<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ webscraping  - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ webscraping  - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Fri, 26 Jun 2026 22:47:31 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/webscraping/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ Web Scraping With RSelenium (Chrome Driver) and Rvest ]]>
                </title>
                <description>
                    <![CDATA[ Web scraping lets you automatically extract data from websites, so you can store it in a structured format for later use. In this article, you'll explore how to use popular R libraries for web scraping to extract data from a website. The target websi... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/web-scraping-with-rselenium-chrome-driver-and-rvest/</link>
                <guid isPermaLink="false">67d8272af45871e3e821d5fa</guid>
                
                    <category>
                        <![CDATA[ Rselenium ]]>
                    </category>
                
                    <category>
                        <![CDATA[ RVest ]]>
                    </category>
                
                    <category>
                        <![CDATA[ selenium ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ webscraping  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ chromedriver ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Elabonga Atuo ]]>
                </dc:creator>
                <pubDate>Mon, 17 Mar 2025 13:44:10 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1742219025681/47c07711-cfa5-482f-a72b-d127bc5b63bc.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Web scraping lets you automatically extract data from websites, so you can store it in a structured format for later use.</p>
<p>In this article, you'll explore how to use popular R libraries for web scraping to extract data from a website. The target website displays different books across multiple pages, requiring navigation between them. You'll learn how to use RVest for data extraction and RSelenium to automate button clicks.</p>
<p>There are a couple of housekeeping rules when it comes to harvesting data on the internet:</p>
<ul>
<li><p><strong>Inspect the robots.txt file</strong>: Check the robots.txt file of a website to understand what data you are allowed to extract. You can find this file by appending “/robots.txt” to the website's home URL.</p>
</li>
<li><p><strong>Review terms and conditions</strong>: Before scraping, read the website's terms and conditions to understand the legal expectations regarding data extraction.</p>
</li>
<li><p><strong>Limit requests</strong>: Avoid overloading the server with requests by implementing rate limiting. The <a target="_blank" href="https://dmi3kno.github.io/polite/">polite</a> library in R can help manage request rates effectively.</p>
</li>
</ul>
<p>Let’s dive in!</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-project-overview">Project Overview</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-project-setup">Project Setup</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-understand-and-inspect-a-webpage">How to Understand and Inspect a Webpage</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-extract-data-using-rvest">How to Extract Data Using RVest</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-mimic-human-behaviour-using-rselenium">How to Mimic Human Behaviour Using RSelenium</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-combine-rselenium-amp-rvest-and-save-to-csv">How to Combine RSelenium &amp; RVest and Save to CSV</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-bringing-it-all-together">Bringing it All Together</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-project-overview">Project Overview</h2>
<p>Here’s what we’re going to be building:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739891904874/e10f91f5-f5ba-4a9d-82d7-bd297b409b1b.gif" alt="e10f91f5-f5ba-4a9d-82d7-bd297b409b1b" class="image--center mx-auto" width="800" height="450" loading="lazy"></p>
<p>This approach to web scraping allows you to see the browser in action as it navigates and extracts data from the website. Unlike headless browsing, where everything runs in the background without a visible interface, this method provides a graphical UI, making it easier to monitor and debug the process.</p>
<p>To practice your data mining skills, you will be scraping data from a website built specifically for that: <a target="_blank" href="https://books.toscrape.com/">Books To Scrape</a>. You are going to be using a driver to drive a browser which will then open your target website. It’ll navigate from the first page, mimicking human behaviour (clicking the next button) while collecting data about the books, right to the last page.</p>
<h2 id="heading-project-setup">Project Setup</h2>
<h3 id="heading-prerequisites"><strong>Prerequisites:</strong></h3>
<p>To follow along with this tutorial, you will need:</p>
<ul>
<li><p>R programming knowledge</p>
</li>
<li><p>HTML knowledge</p>
</li>
<li><p>R Studio installed</p>
</li>
</ul>
<p>Note that I’m building this tutorial on a Windows machine.</p>
<h3 id="heading-setup-and-install-chrome-driver">Setup and Install Chrome Driver</h3>
<p>First, you’ll want to check to make sure you have Java installed on your computer by running this terminal command:</p>
<pre><code class="lang-bash">java -version
</code></pre>
<p>If it’s not present, download and install Java <a target="_blank" href="https://www.java.com/en/download/">here</a>.</p>
<p>Next, install the Chrome browser if you don’t already have it. Once it’s installed, check for your browser version in the settings section.</p>
<p>Then you can download the Browser Driver that corresponds to your Browser Version <a target="_blank" href="https://developer.chrome.com/docs/chromedriver/downloads/version-selection">here</a>. Check where other browser drivers are stored on your device by running this in RStudio terminal:</p>
<pre><code class="lang-r"><span class="hljs-comment"># install and load wdman and binman packages</span>
install.packages(<span class="hljs-string">"wdman"</span>)
<span class="hljs-keyword">library</span>(wdman)

install.packages(<span class="hljs-string">"binman"</span>)
<span class="hljs-keyword">library</span>(binman)

<span class="hljs-comment"># check drivers already installed</span>
binman::list_versions(appname = <span class="hljs-string">"chromedriver"</span>)

<span class="hljs-comment"># check browser driver locations</span>
wdman::selenium(retcommand = <span class="hljs-literal">TRUE</span>, check = <span class="hljs-literal">FALSE</span>)
</code></pre>
<p>Extract the driver “.exe“ and store it at the specified folder location. This is usually the following location:</p>
<pre><code class="lang-bash"><span class="hljs-string">"C:\Users\YourName\AppData\Local\binman\binman_chromedriver\win32\version\chromedriver.exe"</span>
</code></pre>
<p>Now, add the drivers to your system path by specifying the folder path excluding the application. Confirm installation by running the following terminal command.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Chromedriver SYSTEMS PATH: "C:\Users\YourName\AppData\Local\binman\binman_chromedriver\win32\version\"</span>
<span class="hljs-comment"># check chromedriver installation</span>
chromedriver -version
</code></pre>
<h2 id="heading-how-to-understand-and-inspect-a-webpage">How to Understand and Inspect a Webpage</h2>
<p>A webpage is a visual representation of an HTML document that is available on the internet and accessed through a web browser. The components of a webpage, called elements, are structured hierarchically in a HTML DOM (Document Object Model) tree. Each element can be located using specific paths called selectors or locators, which you can read more about <a target="_blank" href="https://testrigor.com/blog/css-selector-vs-xpath-your-pocket-cheat-sheet/">here</a>.</p>
<p>Developer Tools are a set of tools available in your browser. They’re helpful for inspecting and analyzing a webpage’s structure. The feature “Inspect“ helps examine the structure and styling of a specific element. You can access this feature by selecting the element you would like to inspect, right clicking on it, and clicking “Inspect”.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739974770342/59c960b1-2c88-4c1d-a23d-d9e9fee91dc5.gif" alt="Inspecting an element" class="image--center mx-auto" width="1366" height="728" loading="lazy"></p>
<h2 id="heading-how-to-extract-data-using-rvest">How to Extract Data Using RVest</h2>
<p>RVest is an R package that contains a set of functions that enables you to extract data from HTML and XML web pages</p>
<p>We are interested in extracting the following information about books from every page on the website’s catalogue:</p>
<ul>
<li><p>Book Title</p>
</li>
<li><p>Book Rating</p>
</li>
<li><p>Book Price</p>
</li>
<li><p>Individual Book Link</p>
</li>
<li><p>Cover Image Link</p>
</li>
</ul>
<p>Let’s go through the steps for using RVest to extract this data.</p>
<h3 id="heading-step-1-load-the-webpage"><strong>Step 1: Load the webpage</strong></h3>
<p>To load the first page of your target website and parse the HTML document using the RVest package in R, follow these steps:</p>
<ol>
<li><p><strong>Install and load the RVest package</strong>: If you haven't already installed the RVest package, you can do so by running the following command in R:</p>
<pre><code class="lang-r"> install.packages(<span class="hljs-string">"rvest"</span>)
</code></pre>
<p> Then, load the package:</p>
<pre><code class="lang-r"> <span class="hljs-keyword">library</span>(rvest)
</code></pre>
</li>
<li><p><strong>Load the webpage and parse the HTML</strong>: Use the <code>read_html()</code> function from the RVest package to fetch and parse the HTML content of the webpage. Here's an example of how to do this:</p>
<pre><code class="lang-r"> <span class="hljs-comment"># Specify the URL of the target website</span>
 url &lt;- <span class="hljs-string">"https://books.toscrape.com/"</span>

 <span class="hljs-comment"># Fetch and parse the HTML content</span>
 webpage &lt;- read_html(url)
</code></pre>
</li>
</ol>
<p>This code will download the HTML content of the specified webpage and convert it into an XML document, making it easier to structure and organize the data for further processing or storage.</p>
<h3 id="heading-step-2-identify-the-target-elements"><strong>Step 2: Identify the target elements</strong></h3>
<p>The target elements are the HTML elements that contain the specific data you intend to extract.</p>
<p>A quick inspection of the webpage using developer tools shows that the each book’s information is contained in an <code>article</code> tag and forms part of an ordered list. It’s important to specify the <code>&lt;ol&gt;</code> tag in the path, as there are other lists in the tree.</p>
<p>The pipe <code>%&gt;%</code> operator facilitates chaining operations, making it easier to extract elements step by step. <code>html_element()</code> returns the first matching element while <code>html_elements()</code> returns all the elements that match the defined path.</p>
<pre><code class="lang-r"><span class="hljs-comment"># define the path from which other details will be extracted</span>
book &lt;- books %&gt;% html_element(<span class="hljs-string">"ol"</span>)  %&gt;% html_elements(<span class="hljs-string">"li"</span>) %&gt;% html_element(<span class="hljs-string">"article"</span>)

<span class="hljs-comment"># extracting details using css locators.</span>
<span class="hljs-comment"># title</span>
title &lt;- book %&gt;% 
  html_element(<span class="hljs-string">"h3 a"</span>) %&gt;% 
  html_attr(<span class="hljs-string">"title"</span>)

<span class="hljs-comment"># rating</span>
rating &lt;- book %&gt;% 
  html_element(<span class="hljs-string">"p"</span>) %&gt;% 
  html_attr(<span class="hljs-string">"class"</span>)

<span class="hljs-comment"># price</span>
price &lt;- book %&gt;% 
  html_element(<span class="hljs-string">".product_price p"</span>) %&gt;% 
  html_text2()

<span class="hljs-comment">#link to book page</span>
book_link &lt;- book %&gt;% 
  html_element(<span class="hljs-string">"h3 a"</span>) %&gt;% 
  html_attr(<span class="hljs-string">"href"</span>)

<span class="hljs-comment"># cover page image link</span>
cover_page_link &lt;- book %&gt;% 
  html_element(<span class="hljs-string">".image_container a img"</span>) %&gt;% 
  html_attr(<span class="hljs-string">"src"</span>)

<span class="hljs-comment"># inspect right format by selecting the first element of each detail</span>
title[[<span class="hljs-number">1</span>]]
rating[[<span class="hljs-number">1</span>]]
price[[<span class="hljs-number">1</span>]]
book_link[[<span class="hljs-number">1</span>]]
cover_page_link[[<span class="hljs-number">1</span>]]
</code></pre>
<h3 id="heading-step-3-clean-the-rating-data"><strong>Step 3: Clean the “rating” data</strong></h3>
<p>To clean the "star-rating" data, you can use the <code>stringr</code> package in R to remove the unnecessary text and trim any whitespace. Here's how you can do it:</p>
<pre><code class="lang-r"><span class="hljs-keyword">library</span>(stringr)

<span class="hljs-comment"># Example of extracted rating data</span>
rating_data &lt;- <span class="hljs-string">"star-rating Three"</span>

<span class="hljs-comment"># Remove "star-rating " and trim whitespace</span>
cleaned_rating &lt;- str_trim(str_replace(rating_data, <span class="hljs-string">"star-rating "</span>, <span class="hljs-string">""</span>))

<span class="hljs-comment"># Output the cleaned rating</span>
cleaned_rating
</code></pre>
<p>This code will output "Three", effectively removing the "star-rating" prefix and any leading or trailing whitespace.</p>
<h2 id="heading-how-to-mimic-human-behaviour-using-rselenium">How to Mimic Human Behaviour Using RSelenium</h2>
<h3 id="heading-how-selenium-works"><strong>How Selenium Works</strong></h3>
<p>Selenium is a tool that allows you to simulate user actions on a website, usually for testing purposes. RSelenium is an R library that allows you to access this functionality.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739961235501/f358a1e1-6a2f-45dd-a0b0-12925811cab1.png" alt="Diagram illustrating Selenium's architecture. It shows a client with a Selenium script communicating with a server's browser driver using JSON Wire Protocol over HTTP. The server then sends a HTTP request to a browser" class="image--center mx-auto" width="1490" height="573" loading="lazy"></p>
<p>We need a script, a browser, and browser driver to mimic user behaviour. The code you write that contains the instructions detailing the actions you would like to automate is the script. The browser driver acts as a bridge between your script and the browser and performs your desired actions by translating the script into actions.</p>
<p>The script, when run, is the client which requests and receives info from the browser driver’s server.</p>
<p>When you run a script, the script is converted to JSON format data which is then transferred to the browser driver via the JSON Wire Protocol. A protocol is simply a set of rules that define how data should be managed and handle during transfer across devices.</p>
<p>The driver receives and validates the received data. If successful, it communicates the actions defined in the script to the browser. If it’s unsuccessful, an error is sent to the client.</p>
<p>On browser initialization, the driver performs the actions step by step. This carries on to completion or until an error is encountered (missing elements, server errors, and so on). The bidirectional communication between the driver and browser is via HTTP. Finally, the results are sent back to the client and the browser is shut down.</p>
<h3 id="heading-automating-page-navigation-and-data-collection-with-rselenium">Automating Page Navigation and Data Collection with RSelenium</h3>
<pre><code class="lang-r"><span class="hljs-comment"># install and load RSelenium</span>
install.packages(<span class="hljs-string">"RSelenium"</span>)
<span class="hljs-keyword">library</span>(RSelenium)

<span class="hljs-comment"># initialize and run the chrome driver</span>
rD &lt;- rsDriver(browser = <span class="hljs-string">"chrome"</span>, port = <span class="hljs-number">4567L</span>)

<span class="hljs-comment"># extract and assign the client</span>
remDr &lt;- rD[[<span class="hljs-string">"client"</span>]]
</code></pre>
<p>Running <code>rsDriver()</code> starts a Selenium server that launches ChromeDriver. Extract and assign the <code>rD[["client"]]</code> to a variable. This variable allows you to control and interact with the browser.</p>
<p>Sometimes, starting the driver may fail due to reasons such as permission restrictions, missing dependencies, or incorrect setup. If that happens, you can manually launch ChromeDriver by adding the following block of code right after loading the libraries at the top of the script. It is important to ensure the port numbers match.</p>
<pre><code class="lang-r">cDrv &lt;- chrome(verbose = <span class="hljs-literal">FALSE</span>, check = <span class="hljs-literal">FALSE</span>, port = <span class="hljs-number">4567L</span>)
cDrv$process
</code></pre>
<p>Now, navigate to the target webpage:</p>
<pre><code class="lang-r"><span class="hljs-comment"># naivigate to the target site</span>
remDr$navigate(<span class="hljs-string">"https://books.toscrape.com/"</span>)

<span class="hljs-comment">#maximize Chrome Window Size</span>
remDr$maxWindowSize()
</code></pre>
<p>And scroll to the bottom of the page:</p>
<pre><code class="lang-r"><span class="hljs-comment"># scroll to the bottom of the page</span>
webElem &lt;- remDr$findElement(<span class="hljs-string">"css"</span>, <span class="hljs-string">"body"</span>)
webElem$sendKeysToElement(list(key = <span class="hljs-string">"end"</span>))
</code></pre>
<p>The above code locates the body element and simulates pressing the down key to the end of the page.</p>
<p>Now, click Next to navigate to the next page:</p>
<pre><code class="lang-r"><span class="hljs-comment"># locate next button and click next</span>
nextPage &lt;-  remDr$findElement(using = <span class="hljs-string">"css selector"</span>,
                               value = <span class="hljs-string">".next &gt; a"</span>)
nextPage$clickElement()
</code></pre>
<p>Find the element that contains the link to the next page and click on it to redirect you.</p>
<p>Now we’re going to write a while loop that navigates through all the pages, up to page 50, and then closes the browser once it’s done.</p>
<p>A while loop executes a piece of code as long as a specific condition is met. Once the condition is not met, the loop exits.</p>
<pre><code class="lang-r"><span class="hljs-keyword">while</span>(condition is <span class="hljs-literal">TRUE</span>){
    <span class="hljs-comment">#DO SOMETHING</span>
}
</code></pre>
<p>Write a loop that ensures the next page button is clicked as long as the element containing the link to the next page is visible in the HTML DOM.</p>
<p>First, locate the next button element. Its presence in the open webpage makes sure that the loop runs.</p>
<p>The last page does not have a next button, so the loop will exit when it reaches that page (and Selenium will throw an error due to the missing element).</p>
<pre><code class="lang-r">nextPage &lt;- remDr$findElement(using = <span class="hljs-string">"css selector"</span>, value = <span class="hljs-string">".next &gt; a"</span>)
</code></pre>
<p>Wrap the nextPage element search in a <code>tryCatch()</code> block. This prevents the script from crashing if the 'Next' button is missing. If an error occurs, <code>tryCatch()</code> returns <code>NULL</code>, signaling that there are no more pages to navigate.</p>
<p>An <code>if</code> block then checks for a <code>NULL</code> value. If encountered, a message is displayed to inform the client that no 'Next' button was found, and the <code>break</code> statement exits the loop.</p>
<p>Finally, close the browser once the driver navigates to the last page (page 50 in the catalogue) to free up system resources using <code>remDr$close()</code>.</p>
<pre><code class="lang-r">
<span class="hljs-keyword">while</span> (<span class="hljs-literal">TRUE</span>) {  
  <span class="hljs-comment"># Try to find and click "Next" button</span>
  nextPage &lt;- <span class="hljs-keyword">tryCatch</span>({
    remDr$findElement(using = <span class="hljs-string">"css selector"</span>, value = <span class="hljs-string">".next &gt; a"</span>)
  }, error = <span class="hljs-keyword">function</span>(e) {
    <span class="hljs-keyword">return</span>(<span class="hljs-literal">NULL</span>)  <span class="hljs-comment"># No more pages</span>
  })

  <span class="hljs-keyword">if</span> (is.null(nextPage)) {
    message(<span class="hljs-string">"No 'Next' button found. Exiting loop."</span>)
    <span class="hljs-keyword">break</span>
  }

  nextPage$clickElement()
  Sys.sleep(<span class="hljs-number">3</span>)  <span class="hljs-comment"># Allow next page to load</span>

}
print(<span class="hljs-string">"finished scraping"</span>)
remDr$close()
</code></pre>
<h2 id="heading-how-to-combine-rselenium-amp-rvest-and-save-to-csv">How to Combine RSelenium &amp; RVest and Save to CSV</h2>
<p>Now that we’ve extracted data from specific HTML elements using RVest and automated user actions using RSelenium, let’s combine the two to scrape data from all the pages in the website.</p>
<h3 id="heading-create-a-scrape-books-function"><strong>Create a scrape books function</strong></h3>
<p>You will be saving the scraped books information in a CSV file. First, create an empty dataframe to hold the scraped data:</p>
<pre><code class="lang-r"><span class="hljs-comment"># install and load dplyr for dataframe manipulation</span>
install.packages(<span class="hljs-string">"dplyr"</span>)
<span class="hljs-keyword">library</span>(dplyr)

<span class="hljs-comment"># create a dataframe to hold book information</span>
Books &lt;-  data.frame()
</code></pre>
<h3 id="heading-retrieve-and-parse-the-webpage">Retrieve and parse the webpage</h3>
<p>For Rvest to work with RSelenium, you have to retrieve the HTML source of the currently loaded webpage within the Selenium-controlled browser using <code>remDr$getPageSource()[[1]]</code> to extract the HMTL content.</p>
<pre><code class="lang-r">page &lt;- remDr$getPageSource()[[<span class="hljs-number">1</span>]]
</code></pre>
<p>Convert the HTML content to XML using <code>read_html()</code> like this:</p>
<pre><code class="lang-r"> <span class="hljs-comment"># define the path from which other details will be extracted</span>
    books &lt;- read_html(page)  %&gt;% html_element(<span class="hljs-string">"ol"</span>)  %&gt;% html_elements(<span class="hljs-string">"li"</span>) %&gt;% html_element(<span class="hljs-string">"article"</span>)
</code></pre>
<p>Extract each book’s details using CSS selectors with <code>rvest</code> functions. The scraped objects returned are XML objects and lists. They need to be formatted to character strings, preventing unexpected data type issues when working with the data. Do this by piping <code>as.character()</code> at the very end of each extracted detail.</p>
<pre><code class="lang-r">    <span class="hljs-comment"># title</span>
    title &lt;- book %&gt;% 
      html_element(<span class="hljs-string">"h3 a"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"title"</span>) %&gt;% 
      as.character()
</code></pre>
<p>Wrap the block of code used to extract details from HTML elements in a function and return a dataframe whose column values are the book details. This makes the code reusable and modular.</p>
<pre><code class="lang-r">
scrape_books &lt;- <span class="hljs-keyword">function</span>() {
    page &lt;- remDr$getPageSource()[[<span class="hljs-number">1</span>]]

    <span class="hljs-comment"># define the path from which other details will be extracted</span>
    books &lt;- read_html(page)  %&gt;% html_element(<span class="hljs-string">"ol"</span>)  %&gt;% html_elements(<span class="hljs-string">"li"</span>) %&gt;% html_element(<span class="hljs-string">"article"</span>)

    <span class="hljs-comment"># extracting details using css locators.</span>
    <span class="hljs-comment"># title</span>
    title &lt;- book %&gt;% 
      html_element(<span class="hljs-string">"h3 a"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"title"</span>) %&gt;% 
      as.character() 

    <span class="hljs-comment"># rating</span>
    rating &lt;- book %&gt;% 
      html_element(<span class="hljs-string">"p"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"class"</span>) %&gt;% 
      as.character() 

    cleaned_rating &lt;- str_trim(gsub(<span class="hljs-string">"star-rating"</span>, <span class="hljs-string">""</span>, rating))

    <span class="hljs-comment"># price</span>
    price &lt;- book %&gt;% 
      html_element(<span class="hljs-string">".product_price p"</span>) %&gt;% 
      html_text2() %&gt;% 
      as.character() 

    <span class="hljs-comment">#link to book page</span>
    book_link &lt;- book %&gt;% 
      html_element(<span class="hljs-string">"h3 a"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"href"</span>) %&gt;% 
      as.character() 

    <span class="hljs-comment"># image link</span>
    cover_page_link &lt;- book %&gt;% 
      html_element(<span class="hljs-string">".image_container a img"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"src"</span>) %&gt;% 
      as.character() 

    <span class="hljs-keyword">return</span>(data.frame(title,cleaned_rating,price,book_link,cover_page_link, stringsAsFactors = <span class="hljs-literal">FALSE</span>))
}
</code></pre>
<h3 id="heading-write-to-csv"><strong>Write to CSV</strong></h3>
<p>Save the dataframe to a CSV file saved as “books.csv“:</p>
<pre><code class="lang-r">write.csv(Books, file = <span class="hljs-string">"./books.csv"</span>, fileEncoding = <span class="hljs-string">"UTF-8"</span>)
</code></pre>
<h2 id="heading-bringing-it-all-together">Bringing it All Together</h2>
<p>Let’s review what we’ve done so far: First, the script to scrape book data begins by loading the browser, maximizing the window size, and navigating to the Books To Scrape Page.</p>
<p>Then we created an empty dataframe to hold the scraped data. We then scraped the data from the first page, saved it to the dataframe, and located the ‘Next‘ button in order to navigate to the next page – from which we scraped data and stored it.</p>
<p>The process of scraping, adding to the dataframe, and clicking the next page button continues until the ‘Next’ button is no longer available in the HTML DOM.</p>
<p>Once the last page has been reached, the code exits the loop and saves the data to CSV. Finally, it closes the driver to free up system resources.</p>
<pre><code class="lang-r"><span class="hljs-comment"># load libraries</span>
<span class="hljs-keyword">library</span>(wdman)
<span class="hljs-keyword">library</span>(binman)
<span class="hljs-keyword">library</span>(rvest)
<span class="hljs-keyword">library</span>(stringr)
<span class="hljs-keyword">library</span>(RSelenium)
<span class="hljs-keyword">library</span>(dplyr)


cDrv &lt;- chrome(verbose = <span class="hljs-literal">FALSE</span>, check = <span class="hljs-literal">FALSE</span>, port = <span class="hljs-number">4450L</span>)
cDrv$process

rD &lt;- rsDriver(browser = <span class="hljs-string">"chrome"</span>, port = <span class="hljs-number">4450L</span>)
remDr &lt;- rD[[<span class="hljs-string">"client"</span>]]


remDr$navigate(<span class="hljs-string">"https://books.toscrape.com/"</span>)
remDr$maxWindowSize()

page &lt;- remDr$getPageSource()[[<span class="hljs-number">1</span>]]
webElem &lt;- remDr$findElement(<span class="hljs-string">"css"</span>, <span class="hljs-string">"body"</span>)
webElem$sendKeysToElement(list(key = <span class="hljs-string">"end"</span>))

nextPage &lt;-  remDr$findElement(using = <span class="hljs-string">"css selector"</span>,
                               value = <span class="hljs-string">".next &gt; a"</span>)
nextPage$clickElement()


<span class="hljs-comment"># converting the lists containg the scraped data into a dataframe </span>
Books &lt;-  data.frame(title = character(), rating = character(), stringsAsFactors = <span class="hljs-literal">FALSE</span>)

scrape_books &lt;- <span class="hljs-keyword">function</span>() {
    page &lt;- remDr$getPageSource()[[<span class="hljs-number">1</span>]]

    <span class="hljs-comment"># define the path from which other details will be extracted</span>
    books &lt;- read_html(page)  %&gt;% html_element(<span class="hljs-string">"ol"</span>)  %&gt;% html_elements(<span class="hljs-string">"li"</span>) %&gt;% html_element(<span class="hljs-string">"article"</span>)

    <span class="hljs-comment"># extracting details using css locators.</span>
    <span class="hljs-comment"># title</span>
    title &lt;- book %&gt;% 
      html_element(<span class="hljs-string">"h3 a"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"title"</span>) %&gt;% 
      as.character() 

    <span class="hljs-comment"># rating</span>
    rating &lt;- book %&gt;% 
      html_element(<span class="hljs-string">"p"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"class"</span>) %&gt;% 
      as.character() 

    cleaned_rating &lt;- str_trim(gsub(<span class="hljs-string">"star-rating"</span>, <span class="hljs-string">""</span>, rating))

    <span class="hljs-comment"># price</span>
    price &lt;- book %&gt;% 
      html_element(<span class="hljs-string">".product_price p"</span>) %&gt;% 
      html_text2() %&gt;% 
      as.character() 

    <span class="hljs-comment">#link to book page</span>
    book_link &lt;- book %&gt;% 
      html_element(<span class="hljs-string">"h3 a"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"href"</span>) %&gt;% 
      as.character() 

    <span class="hljs-comment"># image link</span>
    cover_page_link &lt;- book %&gt;% 
      html_element(<span class="hljs-string">".image_container a img"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"src"</span>) %&gt;% 
      as.character() 

    <span class="hljs-keyword">return</span>(data.frame(title,cleaned_rating,price,book_link,cover_page_link, stringsAsFactors = <span class="hljs-literal">FALSE</span>))
}

<span class="hljs-comment"># scrape first page</span>
Books &lt;- rbind(Books, scrape_books())

<span class="hljs-keyword">while</span> (<span class="hljs-literal">TRUE</span>) {
  <span class="hljs-comment"># scrape current page</span>
  Books &lt;- rbind(Books, scrape_books())

  <span class="hljs-comment"># find and click "next" button</span>
  nextPage &lt;- <span class="hljs-keyword">tryCatch</span>({
    remDr$findElement(using = <span class="hljs-string">"css selector"</span>, value = <span class="hljs-string">".next &gt; a"</span>)
  }, error = <span class="hljs-keyword">function</span>(e) {
    <span class="hljs-keyword">return</span>(<span class="hljs-literal">NULL</span>)  <span class="hljs-comment"># No more pages</span>
  })

  <span class="hljs-comment"># exit loop if "next" button is missing</span>
  <span class="hljs-keyword">if</span> (is.null(nextPage)) {
    message(<span class="hljs-string">"No 'Next' button found. Exiting loop."</span>)
    <span class="hljs-keyword">break</span>
  }

  nextPage$clickElement()
  <span class="hljs-comment"># Allow next page to load</span>
  Sys.sleep(<span class="hljs-number">3</span>)  

}

write.csv(Books, file = <span class="hljs-string">"./books.csv"</span>, fileEncoding = <span class="hljs-string">"UTF-8"</span>)
print(<span class="hljs-string">"finished scraping"</span>)
remDr$close()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740129915080/2ee1344b-58a8-477b-a568-719ba4336c95.png" alt="2ee1344b-58a8-477b-a568-719ba4336c95" class="image--center mx-auto" width="390" height="993" loading="lazy"></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you learned how to effectively combine RSelenium and RVest to scrape data from a website. By leveraging RSelenium, you can automate user interactions and navigate through web pages, while RVest allows you to extract specific data from HTML elements.</p>
<p>This approach provides a powerful and flexible method for web scraping, enabling you to handle dynamic content and mimic human behavior. By following the steps outlined here, you can successfully scrape data from multiple pages and save it to a CSV file for further analysis.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Dynamic Web Scraper App with Playwright and React: A Step-by-Step Guide ]]>
                </title>
                <description>
                    <![CDATA[ Today we are going to build a small web scraper app. This application will scrape data from the Airbnb website and display it in a nice grid view. We will also add a Refresh button that will be able to trigger a new scraping round and update the resu... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-dynamic-web-scraper-app-with-playwright-and-react/</link>
                <guid isPermaLink="false">6787cb727ee491a35577e9d5</guid>
                
                    <category>
                        <![CDATA[ React ]]>
                    </category>
                
                    <category>
                        <![CDATA[ webscraping  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ playwright ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Mihail Gaberov ]]>
                </dc:creator>
                <pubDate>Wed, 15 Jan 2025 14:51:30 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1736952641440/0aa6255b-45eb-4ae8-b5cb-d87648590e18.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Today we are going to build a small web scraper app. This application will scrape data from the Airbnb website and display it in a nice grid view. We will also add a Refresh button that will be able to trigger a new scraping round and update the results.</p>
<p>In order to make our app a bit more performant, we will utilize the browser’s local storage to store already scraped data so that we don’t trigger new scraping requests every time the browser is refreshed.</p>
<p>Here’s what it will look like:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736256647942/56c176ad-2615-478a-94f8-e33b3b437b92.png" alt="Web Scrapper app interface" class="image--center mx-auto" width="2516" height="1708" loading="lazy"></p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-how-to-spin-up-the-app-with-vite">How to Spin Up the App with Vite</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-the-server">How to Build the Server</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-the-front-end">How to Build the Front End</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-heading-how-to-deploy-to-rendercom">How to Deploy to</a> <a target="_blank" href="http://render.com">render.com</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<p>If you want to jump straight into the code, <a target="_blank" href="https://github.com/mihailgaberov/web-scraper">here</a> <a target="_blank" href="https://orderbook-mihailgaberov.vercel.app/">💁</a> is the GitHub repository with a detailed <a target="_blank" href="https://github.com/mihailgaberov/web-scraper/blob/main/README.md">README 🙌,</a> and <a target="_blank" href="https://scraper-fe.onrender.com/">here</a> you can see the live demo.</p>
<p>Now if you’re ready, let’s go step by step and see how to build and the deploy the app.</p>
<p>First, let’s get everything set up and ready to go.</p>
<h2 id="heading-how-to-spin-up-the-app-with-vite">How to Spin Up the App with Vite</h2>
<p>We will use the <a target="_blank" href="https://vite.dev/">Vite</a> build tool to quickly spin up a bare bones React application, equipped with TailwindCSS for styling. In order to do that, run this in your terminal app:</p>
<pre><code class="lang-bash">npm create vite@latest web-scraper -- --template react
</code></pre>
<p>And then install and configure TailwindCSS as follows:</p>
<pre><code class="lang-bash">npm install -D tailwindcss postcss autoprefixer
npx tailwindcss init -p
</code></pre>
<p>Add the paths to all of your template files in your <code>tailwind.config.js</code> file like this:</p>
<pre><code class="lang-javascript"><span class="hljs-comment">/** <span class="hljs-doctag">@type <span class="hljs-type">{import('tailwindcss').Config}</span> </span>*/</span>
<span class="hljs-keyword">export</span> <span class="hljs-keyword">default</span> {
  <span class="hljs-attr">content</span>: [
    <span class="hljs-string">"./index.html"</span>,
    <span class="hljs-string">"./src/**/*.{js,ts,jsx,tsx}"</span>,
  ],
  <span class="hljs-attr">theme</span>: {
    <span class="hljs-attr">extend</span>: {},
  },
  <span class="hljs-attr">plugins</span>: [],
}
</code></pre>
<p>Now you should have a brand new React application with Tailwind installed and configured.</p>
<p>Let’s start our work with the server.</p>
<h2 id="heading-how-to-build-the-server">How to Build the Server</h2>
<p>Since we are building a full stack application, the bare minimum we need to have in place is a server, a client, and an API. The API will live in the server world and the client app will call the endpoints it exposes in order to fetch the data we need to display on the front end.</p>
<h3 id="heading-set-up-the-http-server-with-expressjs">Set Up the HTTP Server with Express.js</h3>
<p>We are going to use the Express.js library to spin up an HTTP server that will handle our API requests. To do so, follow these steps:</p>
<p>First, install the necessary packages with this command:</p>
<pre><code class="lang-bash">npm install express cors playwright
</code></pre>
<p>Then create an empty <code>server.js</code> file in the project's root folder and add the following code:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">import</span> express <span class="hljs-keyword">from</span> <span class="hljs-string">"express"</span>;
<span class="hljs-keyword">import</span> { chromium } <span class="hljs-keyword">from</span> <span class="hljs-string">"playwright"</span>;
<span class="hljs-keyword">import</span> cors <span class="hljs-keyword">from</span> <span class="hljs-string">"cors"</span>;
<span class="hljs-keyword">import</span> { scrapeListings } <span class="hljs-keyword">from</span> <span class="hljs-string">"./utils/scraper.js"</span>;

<span class="hljs-keyword">const</span> app = express();
<span class="hljs-keyword">const</span> PORT = <span class="hljs-number">5001</span>;

app.use(cors());

app.get(<span class="hljs-string">"/scrape"</span>, <span class="hljs-keyword">async</span> (req, res) =&gt; {
  <span class="hljs-keyword">let</span> browser;
  <span class="hljs-keyword">try</span> {
    browser = <span class="hljs-keyword">await</span> chromium.launch();
    <span class="hljs-keyword">const</span> listings = <span class="hljs-keyword">await</span> scrapeListings({ browser, <span class="hljs-attr">retryCount</span>: <span class="hljs-number">3</span> });
    res.json(listings);
  } <span class="hljs-keyword">catch</span> (error) {
    res.status(<span class="hljs-number">500</span>).json({ <span class="hljs-attr">error</span>: error.message });
  }
});

app.listen(PORT, <span class="hljs-function">() =&gt;</span> {
  <span class="hljs-built_in">console</span>.log(<span class="hljs-string">`Scraper server running on http://localhost:<span class="hljs-subst">${PORT}</span>`</span>);
});
</code></pre>
<p>Before we continue with the scraper, let me first explain what we are doing here.</p>
<p>This is a pretty simple setup of an Express server that is exposing an endpoint called “scrape“. Our client-side application (the front end) can send GET requests to this endpoint and receive the data returned as a result.</p>
<p>What’s important here is the async callback function that we pass to the <code>app.get</code> method. This is where we call our scraping function within a try/catch block. It will return the scraped data or an error if something goes wrong.</p>
<p>The last few lines indicate that our server will listen on the specified PORT, which is set to 5001 here, and display a message in the terminal to show that the server is running.</p>
<h3 id="heading-what-is-web-scraping">What is Web Scraping?</h3>
<p>Before diving into the code, I want to briefly explain web scraping if you’re unfamiliar with it. Web scraping involves automatically reading content from websites using a piece of software. This software is called a “web scraper“. In our case, the scraper is what’s in the <code>scrapeListing</code> function.</p>
<p>An essential part of the scraping process is finding something in the DOM tree of the target website that you can use to select the data we want to scrape. This something is known as a <strong>selector</strong>. Selectors can be different HTML elements, such as tags (h3, p, table) or attributes, like class names or IDs.</p>
<p>You can use various programming techniques or features of the programming language you’re using to create the scraper, aiming for better results when implementing the selecting part of the scraper.</p>
<p>In our case, we’re using <code>[itemprop="itemListElement"]</code> as the selector. But you might wonder, how did we figure this out and decide to use that? How do you know which selector to use?</p>
<p>This is where it gets tricky. You have to manually inspect the DOM tree of the target website and determine what would work best. That’s the case unless the site provides an API specifically designed for scrapers.</p>
<p>Here is how this looks like in practice. This is a screenshot from the Airbnb website:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736588995918/fe2eb87f-e1cb-4474-894f-169fffb8a216.png" alt="Finding a selector from Airbnb website DOM tree" class="image--center mx-auto" width="2348" height="1258" loading="lazy"></p>
<p>Usually, you’ll need the information you’re scraping for some particular purpose, which means you’ll need to store it somewhere and then process it. This processing often involve some kind of visualization of the data. This is where our client application comes into play.</p>
<p>We will store the results of our scraping in the browser's local storage. Then, we will easily display those results in a grid layout using React and TailwindCSS. But before we get to all this, let’s go back to the code to understand how the scraping process is done.</p>
<h3 id="heading-set-up-playwright">Set Up Playwright</h3>
<p>For the scraping functionality, we will use another library that has gotten pretty famous over the last few years: <a target="_blank" href="https://playwright.dev/">Playwright</a>. It mainly serves as an e2e testing solution, but, as you’ll see now, we can use it for scraping the web as well.</p>
<p>We will put the scraping function in a separate file so that we have it all in order and keep the separation of concerns in place.</p>
<p>Create a new folder in the root directory and name it utils. Inside this folder, add a new file named scraper.js and include the following code:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> MAX_RETRIES = <span class="hljs-number">3</span>;

<span class="hljs-keyword">const</span> validateListing = <span class="hljs-function">(<span class="hljs-params">listing</span>) =&gt;</span> {
  <span class="hljs-keyword">return</span> (
    <span class="hljs-keyword">typeof</span> listing.title === <span class="hljs-string">"string"</span> &amp;&amp;
    <span class="hljs-keyword">typeof</span> listing.price === <span class="hljs-string">"string"</span> &amp;&amp;
    <span class="hljs-keyword">typeof</span> listing.link === <span class="hljs-string">"string"</span>
  );
};

<span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> scrapeListings = <span class="hljs-keyword">async</span> ({ browser, retryCount }) =&gt; {
  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> page = <span class="hljs-keyword">await</span> browser.newPage();

    <span class="hljs-keyword">try</span> {
      <span class="hljs-keyword">await</span> page.goto(<span class="hljs-string">"https://www.airbnb.com/"</span>, { <span class="hljs-attr">waitUntil</span>: <span class="hljs-string">"load"</span> });

      <span class="hljs-keyword">await</span> page.waitForSelector(<span class="hljs-string">'[itemprop="itemListElement"]'</span>, {
        <span class="hljs-attr">timeout</span>: <span class="hljs-number">10000</span>,
      });

      <span class="hljs-keyword">const</span> listings = <span class="hljs-keyword">await</span> page.$$eval(
        <span class="hljs-string">'[itemprop="itemListElement"]'</span>,
        <span class="hljs-function">(<span class="hljs-params">elements</span>) =&gt;</span> {
          <span class="hljs-keyword">return</span> elements.slice(<span class="hljs-number">0</span>, <span class="hljs-number">10</span>).map(<span class="hljs-function">(<span class="hljs-params">element</span>) =&gt;</span> {
            <span class="hljs-keyword">const</span> title =
              element.querySelector(<span class="hljs-string">".t1jojoys"</span>)?.innerText || <span class="hljs-string">"N/A"</span>;
            <span class="hljs-keyword">const</span> price =
              element.querySelector(<span class="hljs-string">"._11jcbg2"</span>)?.innerText || <span class="hljs-string">"N/A"</span>;
            <span class="hljs-keyword">const</span> link = element.querySelector(<span class="hljs-string">"a"</span>)?.href || <span class="hljs-string">"N/A"</span>;
            <span class="hljs-keyword">return</span> { title, price, link };
          });
        }
      );

      <span class="hljs-keyword">const</span> validListings = listings.filter(validateListing);

      <span class="hljs-keyword">if</span> (validListings.length === <span class="hljs-number">0</span>) {
        <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">"No listings found"</span>);
      }

      <span class="hljs-keyword">return</span> validListings;
    } <span class="hljs-keyword">catch</span> (pageError) {
      <span class="hljs-keyword">if</span> (retryCount &lt; MAX_RETRIES) {
        <span class="hljs-built_in">console</span>.log(<span class="hljs-string">`Retrying... (<span class="hljs-subst">${retryCount + <span class="hljs-number">1</span>}</span>/<span class="hljs-subst">${MAX_RETRIES}</span>)`</span>);
        <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> scrapeListings(retryCount + <span class="hljs-number">1</span>);
      } <span class="hljs-keyword">else</span> {
        <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(
          <span class="hljs-string">`Failed to scrape data after <span class="hljs-subst">${MAX_RETRIES}</span> attempts: <span class="hljs-subst">${pageError.message}</span>`</span>
        );
      }
    } <span class="hljs-keyword">finally</span> {
      <span class="hljs-keyword">await</span> page.close();
    }
  } <span class="hljs-keyword">catch</span> (browserError) {
    <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">`Failed to launch browser: <span class="hljs-subst">${browserError.message}</span>`</span>);
  } <span class="hljs-keyword">finally</span> {
    <span class="hljs-keyword">if</span> (browser) {
      <span class="hljs-keyword">await</span> browser.close();
    }
  }
};
</code></pre>
<h3 id="heading-retry-mechanism">Retry Mechanism</h3>
<p>At the top of the file, there's a constant called <code>MAX_RETRIES</code> used to implement a <strong>retry mechanism</strong>. This tactic is often used by web scrapers to bypass or overcome protections that some websites have against scraping. We will see how to use it below.</p>
<p>It's important to mention the legal aspect here as well. Always respect the terms and conditions along with the privacy policy of the website you plan to scrape. Use these techniques only to handle technical challenges, not to break the law.</p>
<p>A small helper function follows that you can use to validate the received data. Nothing interesting here.</p>
<p>Next is the main scraping function. We’re passing the browser object, provided by Playwright, and the number of retry attempts as arguments to the function.</p>
<p>There are two try/catch blocks to handle possible failures: one for launching the browser (in headless mode, meaning you won't see anything) and one for the scraping process. In the latter, we’ll use Playwright's features to request the website, wait until the page is fully loaded, and then locate the selector we defined.</p>
<p>In the callback function we pass to <code>$$eval</code>, we access the elements returned by the scraping, allowing us to process them and get the data we want. In this case, I’m using three selectors to fetch the title, price, and link of the property. The first two are class names, and the last is the HTML tag <code>&lt;a&gt;</code>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736589673793/79cd8e05-eef7-4887-b3cb-6d2d95e9d5dc.png" alt="Selecting the price" class="image--center mx-auto" width="2360" height="1244" loading="lazy"></p>
<p>Then we return an object, { title, price, link }, with the fetched data, that is the values of the three properties. And in the end of the try part, we validate the results before returning them to the front end.</p>
<p>What follows in the catch part is the implementation of the retry mechanism we talked about a minute ago:</p>
<pre><code class="lang-javascript"> } <span class="hljs-keyword">catch</span> (pageError) {
      <span class="hljs-keyword">if</span> (retryCount &lt; MAX_RETRIES) {
        <span class="hljs-built_in">console</span>.log(<span class="hljs-string">`Retrying... (<span class="hljs-subst">${retryCount + <span class="hljs-number">1</span>}</span>/<span class="hljs-subst">${MAX_RETRIES}</span>)`</span>);
        <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> scrapeListings(retryCount + <span class="hljs-number">1</span>);
      } <span class="hljs-keyword">else</span> {
        <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(
          <span class="hljs-string">`Failed to scrape data after <span class="hljs-subst">${MAX_RETRIES}</span> attempts: <span class="hljs-subst">${pageError.message}</span>`</span>
        );
      }
    }
</code></pre>
<p>If an error occurs during the reading process, we enter the catch phase and check if the retry count is below the maximum limit we set. If it is, we try again by running the function recursively. Otherwise, we throw an error indicating that the scraping failed and the maximum retry attempts have been reached.</p>
<p>That's all you need to set up basic web scraping of the Airbnb homepage.</p>
<p>You can see all this in the <a target="_blank" href="https://github.com/mihailgaberov/web-scraper">Github repo</a> of the project so don’t worry if you missed something here.</p>
<h2 id="heading-how-to-build-the-front-end">How to Build the Front End</h2>
<p>Now it's time to put the scraped data to use.</p>
<p>Let's display the last 10 properties in a grid layout, allowing you (or anyone) to open them by clicking on their links. We will also add a <code>Refresh</code> feature that lets you perform a new scrape to get the most up-to-date data.</p>
<p>This is how the structure of the front end part of the project looks:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736591529574/5cf5c62f-283b-4cfc-afa4-3b2b92a3ddae.png" alt="Project structure" class="image--center mx-auto" width="410" height="452" loading="lazy"></p>
<p>We have a simple app structure: one main container (App.jsx) that holds all the components and includes some logic for making requests to the API and storing the data in local storage.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">import</span> { useEffect, useState } <span class="hljs-keyword">from</span> <span class="hljs-string">"react"</span>;
<span class="hljs-keyword">import</span> { useLocalStorage } <span class="hljs-keyword">from</span> <span class="hljs-string">"@uidotdev/usehooks"</span>;
<span class="hljs-keyword">import</span> axios <span class="hljs-keyword">from</span> <span class="hljs-string">"axios"</span>;
<span class="hljs-keyword">import</span> Footer <span class="hljs-keyword">from</span> <span class="hljs-string">"./components/Footer"</span>;
<span class="hljs-keyword">import</span> Header <span class="hljs-keyword">from</span> <span class="hljs-string">"./components/Header"</span>;
<span class="hljs-keyword">import</span> RefreshButton <span class="hljs-keyword">from</span> <span class="hljs-string">"./components/RefreshButton"</span>;
<span class="hljs-keyword">import</span> Grid <span class="hljs-keyword">from</span> <span class="hljs-string">"./components/Grid"</span>;
<span class="hljs-keyword">import</span> Loader <span class="hljs-keyword">from</span> <span class="hljs-string">"./components/Loader"</span>;

<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">App</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-keyword">const</span> [listings, setListings] = useLocalStorage(<span class="hljs-string">"properties"</span>, []);
  <span class="hljs-keyword">const</span> [loading, setLoading] = useState(<span class="hljs-literal">false</span>);
  <span class="hljs-keyword">const</span> [error, setError] = useState(<span class="hljs-string">""</span>);

  <span class="hljs-keyword">const</span> fetchListings = <span class="hljs-keyword">async</span> () =&gt; {
    setLoading(<span class="hljs-literal">true</span>);
    setError(<span class="hljs-string">""</span>);
    setListings([]);

    <span class="hljs-keyword">try</span> {
      <span class="hljs-keyword">const</span> response = <span class="hljs-keyword">await</span> axios.get(<span class="hljs-string">"http://localhost:5001/scrape"</span>);
      <span class="hljs-keyword">if</span> (response.data.length === <span class="hljs-number">0</span>) {
        <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">"No listings found"</span>);
      }
      setListings(response.data);
    } <span class="hljs-keyword">catch</span> (err) {
      setError(
        err.response?.data?.error ||
          <span class="hljs-string">"Failed to fetch listings. Please try again."</span>
      );
    } <span class="hljs-keyword">finally</span> {
      setLoading(<span class="hljs-literal">false</span>);
    }
  };

  useEffect(<span class="hljs-function">() =&gt;</span> {
    <span class="hljs-keyword">if</span> (listings.length === <span class="hljs-number">0</span>) {
      fetchListings();
    }
  }, []);

  <span class="hljs-keyword">return</span> (
    <span class="xml"><span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">className</span>=<span class="hljs-string">"flex flex-col items-center justify-center min-h-screen bg-gray-100"</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">Header</span> /&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">RefreshButton</span> <span class="hljs-attr">callback</span>=<span class="hljs-string">{fetchListings}</span> <span class="hljs-attr">loading</span>=<span class="hljs-string">{loading}</span> /&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">main</span> <span class="hljs-attr">className</span>=<span class="hljs-string">"flex flex-col items-center justify-center flex-1 w-full px-4 relative"</span>&gt;</span>
        {error &amp;&amp; <span class="hljs-tag">&lt;<span class="hljs-name">p</span> <span class="hljs-attr">className</span>=<span class="hljs-string">"text-red-500"</span>&gt;</span>{error}<span class="hljs-tag">&lt;/<span class="hljs-name">p</span>&gt;</span>}
        {loading ? <span class="hljs-tag">&lt;<span class="hljs-name">Loader</span> /&gt;</span> : <span class="hljs-tag">&lt;<span class="hljs-name">Grid</span> <span class="hljs-attr">listings</span>=<span class="hljs-string">{listings}</span> /&gt;</span>}
      <span class="hljs-tag">&lt;/<span class="hljs-name">main</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">Footer</span> /&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span></span>
  );
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">default</span> App;
</code></pre>
<p>All components are placed in the <a target="_blank" href="https://github.com/mihailgaberov/web-scraper/tree/main/src/components">components</a> directory (this is what I call a surprise, ah:)). Most of the components are quite simple, and I included them to give the app a more complete appearance.</p>
<p>The <a target="_blank" href="https://github.com/mihailgaberov/web-scraper/blob/main/src/components/Header.jsx">Header</a> displays the top bar. The <a target="_blank" href="https://github.com/mihailgaberov/web-scraper/blob/main/src/components/RefreshButton.jsx">RefreshButton</a> is used to send a new request and get the latest data. In the <code>&lt;main&gt;</code> section, we either show an error message if fetching fails, or we display a <a target="_blank" href="https://github.com/mihailgaberov/web-scraper/blob/main/src/components/Loader.jsx">Loader</a> component and a <a target="_blank" href="https://github.com/mihailgaberov/web-scraper/blob/main/src/components/Grid.jsx">Grid</a> component.</p>
<p>The loading part is straightforward. The Grid component is the interesting one. We pass the scraping results to it using a prop called 'listings'. Inside, we use a simple map() function to go through them and display the properties. We use Tailwind to style the grid, ensuring the properties are neatly listed and look good on both desktop and mobile screens.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736593771753/0f56930c-2944-452c-a2cd-4c6b69c1d996.png" alt="The app looks good on smaller screens as well." class="image--center mx-auto" width="1338" height="1732" loading="lazy"></p>
<p>And in the end with we have the <a target="_blank" href="https://github.com/mihailgaberov/web-scraper/blob/main/src/components/Footer.jsx">Footer</a> component that renders a simple bar with text. Again, I’ve added it just for completeness.</p>
<h2 id="heading-how-to-deploy-to-rendercom">How to Deploy to render.com</h2>
<p>Maybe a little over a year ago, I needed a place to deploy full-stack applications, ideally for free, since they were just for educational purposes.</p>
<p>After some research, I found a platform called <a target="_blank" href="https://dashboard.render.com/">Render</a> and managed to deploy an app with both client and server parts, getting it to work online. I left it there until now. Since our scraper requires both parts to function properly, we will deploy it there and have it working online, as you can see <a target="_blank" href="https://scraper-fe.onrender.com/">here</a>.</p>
<p>To do this, you need to create an account with Render and use their dashboard application. The process is simple, but I'll include a few screenshots below for clarity.</p>
<p>This is the Overview page where you can see all your projects:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736594350231/eac21d82-f43d-44be-8027-6ac25c86f740.png" alt="Render platform overview page" class="image--center mx-auto" width="2494" height="784" loading="lazy"></p>
<p>Here is the Project page where you can view and manage your projects. In our case, we can see both the server and the client app as separate services.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736594435708/a4329d74-f529-46a0-a193-d6a1ade145ad.png" alt="a4329d74-f529-46a0-a193-d6a1ade145ad" class="image--center mx-auto" width="1874" height="1130" loading="lazy"></p>
<p>You can click on each service to open its page, where you can view the deployments and the commits that triggered them. You can find even more details if you explore further.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736594566165/39e78a87-d8e7-4df3-8e1b-02eb0eee799c.png" alt="39e78a87-d8e7-4df3-8e1b-02eb0eee799c" class="image--center mx-auto" width="1846" height="1690" loading="lazy"></p>
<p>You should be able to manage the deployment process on your own, as everything is clearly explained. But if you need help, feel free to reach out.</p>
<p>I should mention that I am not affiliated with Render in any way and I am not receiving any benefits for mentioning them here. I just found it to be a useful tool and wanted to share it with you – so I’ve used it here.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>A web scraper app can be a powerful tool for gathering data, but there are several areas for improvement and important considerations to keep in mind here.</p>
<p>Firstly, you can enhance the app's performance and efficiency by optimizing the scraping process and ensuring that the data is stored and processed effectively. You can also implement more robust error handling and retry mechanisms to improve the reliability of the scraper.</p>
<p>Also, keep in mind that ethical scraping is crucial, and it's important to always respect the terms of service and privacy policies of the websites you are scraping. This includes not overloading the website with requests and ensuring that the data is used responsibly. Always seek permission from the site if required and consider using APIs provided by the website as a more ethical and reliable alternative.</p>
<p>Lastly, respecting the law is paramount. Ensure that your scraping activities comply with legal regulations and guidelines to avoid any potential legal issues. By focusing on these aspects, you can build a more effective, ethical, and legally compliant web scraper app.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
