<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ web scraping - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ web scraping - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Mon, 15 Jun 2026 23:29:41 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/web-scraping/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ Web Scraping for Beginners 2026 ]]>
                </title>
                <description>
                    <![CDATA[ If you have ever wanted to collect product data, monitor competitors, track SEO rankings, or build AI tools that pull information from the internet, you have likely run into the common frustrations of ]]>
                </description>
                <link>https://www.freecodecamp.org/news/web-scraping-for-beginners-2026/</link>
                <guid isPermaLink="false">6a28c9110fd9e5ca15ede211</guid>
                
                    <category>
                        <![CDATA[ web scraping ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Wed, 10 Jun 2026 02:16:49 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5f68e7df6dfc523d0a894e7c/d37db125-5538-429b-bc55-3f99005a83cd.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you have ever wanted to collect product data, monitor competitors, track SEO rankings, or build AI tools that pull information from the internet, you have likely run into the common frustrations of web scraping: broken scripts, rate limits, bot detection, and tedious CAPTCHAs.</p>
<p>We just published a new tutorial on the freeCodeCamp.org YouTube channel, featuring software developer and course creator Ania Kubow.</p>
<p>In this comprehensive, beginner-friendly course, Ania teaches you a much simpler, more efficient approach. Instead of building scrapers from scratch, you will learn how to leverage an API to handle the heavy lifting for you.</p>
<p>Throughout this tutorial, you will master the following:</p>
<ul>
<li><p>How to bypass web scraping obstacles like bot protection and rate limits using <a href="https://serpapi.com?utm_source=freecodecamp">SerpApi, the Web Search API</a>.</p>
</li>
<li><p>How to extract structured JSON data directly from search engines like Google, Amazon, YouTube, and more.</p>
</li>
<li><p>How to use the Google Lens API to scrape images and visual matches.</p>
</li>
<li><p>How to build your own functional web application that searches for and downloads content locally to your computer.</p>
</li>
</ul>
<p>By the end of this video, you will have the knowledge and the basic code necessary to turn internet data into actionable insights for your own projects.</p>
<p>Watch the full tutorial on <a href="https://youtu.be/j6hnjNhx_MM">the freeCodeCamp.org YouTube channel</a> (1 -hour watch).</p>
<div class="embed-wrapper"><iframe width="560" height="315" src="https://www.youtube.com/embed/j6hnjNhx_MM" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Traditional Scraping vs AI Scraping: A Practical Guide for Developers and Data Teams ]]>
                </title>
                <description>
                    <![CDATA[ Enormous amounts of data are constantly generated on the open web. Product prices change, job listings go live and get taken down, news articles are published, and company information gets updated. Fo ]]>
                </description>
                <link>https://www.freecodecamp.org/news/traditional-scraping-vs-ai-scraping/</link>
                <guid isPermaLink="false">69e156abffbb787634ffa61a</guid>
                
                    <category>
                        <![CDATA[ web scraping ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Joel Olawanle ]]>
                </dc:creator>
                <pubDate>Thu, 16 Apr 2026 21:37:47 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/c7f16b00-5245-48e2-a864-38a86d4292ae.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Enormous amounts of data are constantly generated on the open web. Product prices change, job listings go live and get taken down, news articles are published, and company information gets updated.</p>
<p>For developers and teams that rely on this kind of data, the question has never been whether to scrape the web, but how to do so reliably over time.</p>
<p>For a long time, the approach has been straightforward. You inspect a page, write selectors, and extract the data using tools like <a href="https://pypi.org/project/beautifulsoup4/">BeautifulSoup</a> or browser automation libraries like <a href="https://playwright.dev/">Playwright</a> and <a href="https://www.selenium.dev/">Selenium</a>. This works well, but it comes with a familiar problem: the moment the structure of a page changes, your scraper breaks and needs fixing.</p>
<p>Recently, a different approach has started gaining attention. Instead of writing selectors, you describe what you want and let the system figure out how to extract it. This is what people refer to as AI scraping.</p>
<p>Both approaches are widely used today, but they solve the problem in very different ways. This guide breaks down how each one works, where each one fits, and how to decide which approach makes sense for your use case.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-is-traditional-web-scraping">What is Traditional Web Scraping?</a></p>
</li>
<li><p><a href="#heading-traditional-scraping-in-practice">Traditional Scraping in Practice</a></p>
</li>
<li><p><a href="#heading-what-is-ai-web-scraping">What is AI Web Scraping?</a></p>
</li>
<li><p><a href="#heading-ai-scraping-in-practice">AI Scraping in Practice</a></p>
</li>
<li><p><a href="#heading-traditional-scraping-vs-ai-scraping-when-to-use-each">Traditional Scraping vs AI Scraping: When to Use Each</a></p>
</li>
</ul>
<h2 id="heading-what-is-traditional-web-scraping">What is Traditional Web Scraping?</h2>
<p><a href="https://www.freecodecamp.org/news/web-scraping-in-javascript-with-puppeteer/">Traditional web scraping</a> scraping is built on a simple idea that if a browser can load a page and display data to a user, then a program should be able to do the same and extract that data automatically.</p>
<p>This is done with CSS selectors and XPath. For CSS selectors, a selector like <code>.product-card .price</code> means “find the price element inside a product card.” It's easy to understand and works well for most use cases.</p>
<p>XPath, on the other hand, is more powerful but more complex. It allows you to navigate the structure of a page in more detail, including moving up and down the DOM, filtering by text, or handling deeply nested elements.</p>
<p>In practice, most developers start with CSS selectors and only use XPath when the structure becomes too complex.</p>
<p>This idea has been around since the early days of the web. Instead of manually copying information from a page, developers started writing scripts that send requests, receive HTML responses, and extract the pieces they care about.</p>
<p>At its core, nothing about that model has really changed.</p>
<p>You still fetch a page, inspect its structure, and extract data from it. The difference today is not the concept, but how sophisticated the tooling and scale have become.</p>
<h3 id="heading-the-tools-behind-traditional-scraping">The Tools Behind Traditional Scraping</h3>
<p>Over time, a solid ecosystem of tools has developed around this approach.</p>
<ul>
<li><p><a href="https://pypi.org/project/requests/"><strong>Requests</strong></a> is the de facto Python library for making HTTP calls. Most traditional scrapers use <code>requests</code> to fetch pages and then pass the response to BeautifulSoup for parsing. It's simple and reliable for static sites.</p>
</li>
<li><p><a href="https://pypi.org/project/beautifulsoup4/"><strong>BeautifulSoup</strong></a> is a Python library for parsing HTML and XML. It takes raw HTML and builds a navigable tree of objects from it. It's fast to learn, very readable, and excellent for static pages. Its main limitation is that it has no browser engine, so it can't execute JavaScript. If a site renders content dynamically after page load, BeautifulSoup will see an empty container.</p>
</li>
<li><p><a href="https://www.selenium.dev/"><strong>Selenium</strong></a> and <a href="https://playwright.dev/"><strong>Playwright</strong></a> are browser automation tools that control a real browser. They can click buttons, scroll, and wait for JavaScript to finish loading before extracting data. The trade-off is that they are slower and more resource-intensive than simple HTTP requests, but they are necessary for dynamic sites.</p>
</li>
</ul>
<h2 id="heading-traditional-scraping-in-practice">Traditional Scraping in Practice</h2>
<p>Let's build a real, working scraper using <a href="https://books.toscrape.com/">Books to Scrape</a>, a sandbox site built specifically for practicing web scraping. The goal is to extract the title, price, and star rating for every book listed on the first page.</p>
<h3 id="heading-step-1-install-dependencies">Step 1: Install Dependencies</h3>
<pre><code class="language-plaintext">pip install requests beautifulsoup4
</code></pre>
<h3 id="heading-step-2-inspect-the-page">Step 2: Inspect the Page</h3>
<p>Before writing a single line of code, open the target page in your browser and inspect its HTML. Right-click any book title and choose "Inspect" to see the structure.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5e33f66ffc003cfb0c64069e/3b47f64a-8ac8-4cf2-9237-b117355bec5f.png" alt="Inspecting the page structure" style="display:block;margin:0 auto" width="3018" height="1654" loading="lazy">

<p>You'll notice each book lives inside an <code>&lt;article class="product_pod"&gt;</code> element, and within it:</p>
<ul>
<li><p>The title is in the <code>&lt;h3&gt;</code> tag, inside an <code>&lt;a&gt;</code> element (as a <code>title</code> attribute)</p>
</li>
<li><p>The price is in a <code>&lt;p class="price_color"&gt;</code> element</p>
</li>
<li><p>The star rating is encoded in the CSS class of a <code>&lt;p&gt;</code> element — for example, <code>&lt;p class="star-rating Three"&gt;</code> means three stars</p>
</li>
</ul>
<p>This is the core detective work of traditional scraping: you study the HTML, find the patterns, and write selectors to match them.</p>
<h3 id="heading-step-3-write-the-scraper">Step 3: Write the Scraper</h3>
<pre><code class="language-python">import requests
from bs4 import BeautifulSoup

# 1. Fetch the page
url = "https://books.toscrape.com/"
response = requests.get(url)

# Always check the request succeeded before going further
if response.status_code != 200:
    print(f"Failed to fetch page: {response.status_code}")
    exit()

# 2. Parse the HTML
soup = BeautifulSoup(response.content, "html.parser")

# 3. Find all book containers on the page
books = soup.select("article.product_pod")

# 4. Extract data from each book
results = []

for book in books:
    # Title is stored as an attribute, not visible text
    title = book.select_one("h3 a")["title"]

    # Price is the text inside the price element
    price = book.select_one("p.price_color").get_text(strip=True)

    # Rating is encoded as a word in the CSS class: "star-rating Three"
    # We grab the second class name and map it to a number
    rating_word = book.select_one("p.star-rating")["class"][1]
    rating_map = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
    rating = rating_map.get(rating_word, 0)

    results.append({
        "title": title,
        "price": price,
        "rating": rating
    })

# 5. Display results
for book in results:
    print(f"{book['title']} | {book['price']} | {book['rating']} stars")
</code></pre>
<h3 id="heading-step-4-run-it">Step 4: Run It</h3>
<pre><code class="language-bash">python scraper.py
</code></pre>
<p>Your output will look something like this:</p>
<pre><code class="language-bash">A Light in the Attic | £51.77 | 3 stars
Tipping the Velvet | £53.74 | 1 stars
Soumission | £50.10 | 1 stars
Sharp Objects | £47.82 | 4 stars
Sapiens: A Brief History of Humankind | £54.23 | 5 stars
...
</code></pre>
<p>Twenty books, all structured and clean.</p>
<h3 id="heading-step-5-extend-it-to-multiple-pages">Step 5: Extend It to Multiple Pages</h3>
<p>The site has 50 pages. Extending the scraper to crawl all of them requires following the "next" button:</p>
<pre><code class="language-python">import requests
from bs4 import BeautifulSoup

BASE_URL = "https://books.toscrape.com/catalogue/"
start_url = "https://books.toscrape.com/catalogue/page-1.html"

all_books = []
url = start_url

while url:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    for book in soup.select("article.product_pod"):
        title = book.select_one("h3 a")["title"]
        price = book.select_one("p.price_color").get_text(strip=True)
        rating_word = book.select_one("p.star-rating")["class"][1]
        rating_map = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
        rating = rating_map.get(rating_word, 0)
        all_books.append({"title": title, "price": price, "rating": rating})

    # Check for a "next" button and follow it
    next_btn = soup.select_one("li.next a")
    url = BASE_URL + next_btn["href"] if next_btn else None

print(f"Scraped {len(all_books)} books total.")
</code></pre>
<p>Running this crawls all 1,000 books across all 50 pages.</p>
<h3 id="heading-what-makes-this-approach-fragile">What Makes This Approach Fragile</h3>
<p>This scraper works well today because <code>books.toscrape.com</code> is a static, stable sandbox. In production, the same approach has a well-known weakness: it's completely dependent on the HTML structure staying the same.</p>
<p>If the site's developer renames <code>product_pod</code> to <code>book-card</code>, or moves the price into a <code>&lt;div&gt;</code> instead of a <code>&lt;p&gt;</code>, every selector breaks. You get no data, or worse, incorrect data with no error, and you only discover the breakage when someone notices the output looks wrong.</p>
<p>This is one of the problems AI scraping is designed to address.</p>
<h2 id="heading-what-is-ai-web-scraping">What is AI Web Scraping?</h2>
<p>Traditional scraping works by following the structure of a page. It looks for specific elements, class names, or patterns in the HTML and extracts data based on those rules.</p>
<p>AI-powered scraping approaches the same problem differently. Instead of relying only on structure, it focuses on understanding the content itself. It looks at a page and identifies what something represents, not just where it's located.</p>
<p>In a traditional scraper, you might write something like:</p>
<pre><code class="language-javascript">response.css(".product-card .price::text").get()
</code></pre>
<p>You're telling the system exactly where to look. But, with AI scraping, you describe the outcome:</p>
<pre><code class="language-plaintext">Extract the product name, price, and availability for each item on this page.
</code></pre>
<p>The system reads the page, identifies what appears to be a product listing, extracts the relevant fields, and returns structured data.</p>
<h3 id="heading-whats-actually-happening-under-the-hood">What's Actually Happening Under the Hood</h3>
<p>AI scraping can feel like magic at first, but it's built on a combination of familiar components.</p>
<p>At the core are <strong>large language models (LLMs)</strong> trained on vast amounts of text, including web content and HTML. Over time, they learn patterns such as what a product listing looks like, how prices are usually presented, or how job listings are structured.</p>
<p>When given a page, the model can recognize these patterns and map them to the fields you asked for.</p>
<p>But the model is only one part of the system. You still need something to load and interact with the page. That is where <strong>browser automation</strong> comes in. Most AI scraping tools rely on headless browsers like Chromium or frameworks like Playwright to render pages, execute JavaScript, and handle real-world behavior such as scrolling or clicking.</p>
<p>On top of that, there's a layer that interprets your input. When you write a <strong>prompt</strong> describing the data you want, the system translates that into an extraction task. It decides what parts of the page are relevant and how to structure the output.</p>
<p>Finally, the system formats the results into clean data, typically as JSON or CSV, so you can use them directly with minimal post-processing.</p>
<p><strong>Note:</strong> Tools like ChatGPT can interpret content, but they're not scraping systems. They don't crawl pages, handle workflows, or run repeatable data extraction. AI scraping tools combine this intelligence with the infrastructure required to collect data reliably.</p>
<h3 id="heading-popular-tools-behind-ai-scraping">Popular Tools Behind AI Scraping</h3>
<p>As AI scraping has grown more popular, a number of tools have emerged that make this approach accessible without requiring you to build everything from scratch.</p>
<p>For example:</p>
<ul>
<li><p><a href="https://spidra.io/"><strong>Spidra</strong></a> takes a pretty direct approach to extraction. You describe the data you want, and it handles loading the page, interpreting the content, and returning structured results. It also manages things like navigation and interactions behind the scenes, which makes it useful when you want to extract data without worrying about selectors or maintaining scraping logic.</p>
</li>
<li><p><a href="https://www.firecrawl.dev/"><strong>Firecrawl</strong></a> focuses on turning web pages into clean, structured content. Instead of extracting specific fields like price or title, it converts entire pages into formats like markdown or simplified JSON. This makes it especially useful when you want to feed web content into AI systems or work with it in a readable format without dealing with messy HTML.</p>
</li>
<li><p><a href="https://jina.ai/reader/"><strong>Jina Reader</strong></a> is designed to simplify web pages into clean text. It strips away layout noise such as navigation, ads, and styling, and focuses on the actual content. This is helpful when your goal is to understand or process the information on a page rather than extract structured fields.</p>
</li>
<li><p><a href="https://brightdata.com/products/web-scraper/ai"><strong>Bright Data AI scrapers</strong></a> combine AI-based extraction with a strong scraping infrastructure. They allow you to request structured data without writing selectors, while also handling challenges like blocking and scaling. This makes them more suitable for larger or more demanding scraping tasks.</p>
</li>
<li><p><a href="https://apify.com/"><strong>Apify</strong></a> sits somewhere in between traditional and AI-driven scraping. It provides a full platform for building and running scrapers, and allows you to introduce AI where it makes sense, whether for extraction or post-processing. This makes it useful when you need more control over the entire pipeline.</p>
</li>
</ul>
<p>In practice, these tools aren't trying to solve the exact same problem. Some focus on extracting structured data, others on cleaning content, and others on building full scraping workflows. The right choice depends on what you're trying to achieve, not just the tool itself.</p>
<h2 id="heading-ai-scraping-in-practice">AI Scraping in Practice</h2>
<p>Let's run the same data collection task of extracting books from <code>books.toscrape.com</code> using an AI scraping tool. We'll use Spidra's API so you can see exactly what changes.</p>
<h3 id="heading-step-1-get-an-api-key">Step 1: Get an API Key</h3>
<p>Sign up at <a href="http://spidra.io">spidra.io</a> and create an API key from your dashboard. You'll use this key to authenticate every request.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5e33f66ffc003cfb0c64069e/b4dd4103-d9ee-4193-b8df-b62cd815bb16.png" alt="Getting Spidra API key" style="display:block;margin:0 auto" width="3024" height="1656" loading="lazy">

<h3 id="heading-step-2-understand-the-api-structure">Step 2: Understand the API Structure</h3>
<p>Spidra's scrape endpoint accepts a JSON payload. The two most important fields are <code>url</code> (where to scrape) and <code>prompt</code> (what to extract, written in plain English). You can optionally specify the <code>output</code> format — JSON works best for structured data.</p>
<pre><code class="language-bash">POST https://api.spidra.io/scrape
Authorization: Bearer YOUR_API_KEY
Content-Type: application/json
</code></pre>
<p>You see, we don't need selectors or HTML inspection. Just a URL and a description.</p>
<h3 id="heading-step-3-write-a-single-page-extraction">Step 3: Write a Single-Page Extraction</h3>
<p>Here's the equivalent of our traditional scraper, written as an API call:</p>
<pre><code class="language-python">import requests
import json

API_KEY = "your_api_key_here"

payload = {
    "urls": [{"url": "https://books.toscrape.com/"}],
    "prompt": "Extract all books on this page. For each book, return the title, price, and star rating as a number from 1 to 5.",
    "output": "json"
}

response = requests.post(
    "https://api.spidra.io/scrape",
    headers={
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    },
    json=payload
)

data = response.json()
print(json.dumps(data, indent=2))
</code></pre>
<p>That's the entire scraper. No <code>BeautifulSoup</code>, no selector logic, and no HTML parsing.</p>
<h3 id="heading-step-4-understand-the-output">Step 4: Understand the Output</h3>
<p>The API returns a structured JSON response. Each book is represented as an object with the fields you described:</p>
<pre><code class="language-json">{
  "results": [
    {
      "title": "A Light in the Attic",
      "price": "£51.77",
      "rating": 3
    },
    {
      "title": "Tipping the Velvet",
      "price": "£53.74",
      "rating": 1
    },
    {
      "title": "Soumission",
      "price": "£50.10",
      "rating": 1
    }
    ...
  ]
}
</code></pre>
<p>The model identified the star rating encoding (<code>star-rating Three</code> → 3) without being told how ratings are represented. It understood the intent of "star rating as a number from 1 to 5" and handled the mapping itself.</p>
<h3 id="heading-step-5-use-actions-for-multi-step-workflows">Step 5: Use Actions for Multi-Step Workflows</h3>
<p>Where AI scraping starts to show its real advantages is with workflows that would require significant engineering in a traditional scraper.</p>
<p>Suppose you want to visit each book's detail page and extract the full description and availability status (not just what's visible on the listing page).</p>
<p>In a traditional scraper, this means building a follow-link loop, managing state, handling errors on each detail page, and maintaining separate selectors for the detail page's different structure. In an AI scraper like Spidra, you can mimic a real human interaction with <a href="https://docs.spidra.io/features/actions">browser actions</a>:</p>
<pre><code class="language-python">{
  "urls": [{
    "url": "https://books.toscrape.com/catalogue/category/books/mystery_3/index.html",
    "actions": [{
      "type":            "forEach",
      "observe":         "Find all book cards in the product grid",
      "mode":            "inline",
      "captureSelector": "article.product_pod",
      "maxItems":        10,
      "itemPrompt":      "Extract the book title, price, and star rating (One/Two/Three/Four/Five). Return as JSON: {title, price, star_rating}"
    }]
  }]
}
</code></pre>
<p>The system navigates to each book's page, reads the new content, extracts the additional fields, and returns them as part of the same result set.</p>
<p>You can also configure how you want your data to be:</p>
<pre><code class="language-python">{
  "urls": [{ "url": "https://jobs.example.com/senior-engineer" }],
  "prompt": "Extract the job details",
  "schema": {
    "type": "object",
    "required": ["title", "company", "remote", "employment_type"],
    "properties": {
      "title":           { "type": "string" },
      "company":         { "type": "string" },
      "location":        { "type": ["string", "null"] },
      "remote":          { "type": ["boolean", "null"] },
      "salary_min":      { "type": ["number", "null"] },
      "salary_max":      { "type": ["number", "null"] },
      "employment_type": {
        "type": ["string", "null"],
        "enum": ["full_time", "part_time", "contract", null]
      },
      "skills": {
        "type": "array",
        "items": { "type": "string" }
      }
    }
  }
}
</code></pre>
<p>There is more to these AI scrapers, like <a href="https://docs.spidra.io/features/batch-scraping">batch scraping</a>, AI crawling, and lots more.</p>
<h3 id="heading-where-ai-scraping-earns-its-keep">Where AI Scraping Earns Its Keep</h3>
<p>Now suppose the site updates its frontend. The class <code>product_pod</code> gets renamed to <code>book-card</code>. The price moves into a different element.</p>
<p>In the traditional scraper, you get zero results and no error until you notice the data is missing. You then re-inspect the page, update the selectors, test, and redeploy.</p>
<p>In the AI scraper, you run the same prompt. The model isn't looking for <code>product_pod</code> or <code>price_color</code>. It's looking for content that resembles a product listing with pricing information. The layout change is invisible to the extraction logic.</p>
<p>This is the core operational advantage of the AI approach: <strong>structural changes to a page don't automatically break your extraction.</strong></p>
<h2 id="heading-traditional-scraping-vs-ai-scraping-when-to-use-each">Traditional Scraping vs AI Scraping: When to Use Each</h2>
<p>At this point, the difference between the two approaches is clear. The more important question is when each one actually makes sense in practice.</p>
<p>A simple way to think about it is this:</p>
<table>
<thead>
<tr>
<th><strong>Scenario</strong></th>
<th><strong>Traditional Scraping</strong></th>
<th><strong>AI Scraping</strong></th>
</tr>
</thead>
<tbody><tr>
<td>Stable websites</td>
<td>✅ Best choice</td>
<td>✅ Works but may sometimes become an overkill</td>
</tr>
<tr>
<td>Frequently changing layouts</td>
<td>❌ Breaks often</td>
<td>✅ More resilient</td>
</tr>
<tr>
<td>Large-scale crawling</td>
<td>✅ More cost-efficient</td>
<td>✅ Efficient but can get expensive</td>
</tr>
<tr>
<td>Fast prototyping</td>
<td>❌ Slower setup</td>
<td>✅ Very fast</td>
</tr>
<tr>
<td>Non-technical users</td>
<td>❌ Requires coding</td>
<td>✅ More accessible</td>
</tr>
<tr>
<td>Full control &amp; transparency</td>
<td>✅ High control</td>
<td>❌ Less transparent</td>
</tr>
<tr>
<td>Messy or inconsistent data</td>
<td>❌ Hard to maintain</td>
<td>✅ Easier to handle</td>
</tr>
<tr>
<td>Complex workflows (login, steps)</td>
<td>⚠️ Possible but manual</td>
<td>✅ Often built-in</td>
</tr>
</tbody></table>
<p>In practice, it's not a cut-and-dry choice between the two. Traditional scraping works best when everything is predictable and stable. AI scraping becomes useful when things are messy, dynamic, or time-sensitive. Most real-world systems combine both approaches rather than relying on one alone.</p>
<h2 id="heading-wrapping-up">Wrapping up</h2>
<p>Web scraping is not going away. What's changing is how we approach it.</p>
<p>Traditional scraping gives you control and precision, but it can be fragile and time-consuming to maintain. AI scraping makes things faster and more flexible, especially when dealing with messy or constantly changing pages, but it comes with less transparency.</p>
<p>In practice, most real-world workflows are starting to combine both.</p>
<p>We're also beginning to see AI scraping tools integrate into larger systems, especially with AI agents and MCP-style setups, where scraping becomes something that can be triggered on demand rather than built from scratch each time.</p>
<p>The key takeaway is simple. Traditional scraping tells the system where the data is. AI scraping tells the system what the data means.</p>
<p>Knowing when to use each is what actually matters.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Turn Websites into LLM-Ready Data Using Firecrawl ]]>
                </title>
                <description>
                    <![CDATA[ If you’ve ever tried feeding web pages into an AI model, you know the pain. Websites come with ads, navigation bars, and messy HTML. Before your Large Language Model (LLM) can understand the content, you must clean and format it. That’s where Firecra... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-turn-websites-into-llm-ready-data-using-firecrawl/</link>
                <guid isPermaLink="false">68f9002b255bb54c291b6f88</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ web scraping ]]>
                    </category>
                
                    <category>
                        <![CDATA[ APIs ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Wed, 22 Oct 2025 16:02:51 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1761148818578/a9572dc3-cc79-4ba9-ab47-4270e465df70.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you’ve ever tried feeding web pages into an AI model, you know the pain.</p>
<p>Websites come with ads, navigation bars, and messy HTML. Before your Large Language Model (LLM) can understand the content, you must clean and format it.</p>
<p>That’s where <a target="_blank" href="https://github.com/firecrawl/firecrawl">Firecrawl</a> makes life easy. It’s an open-source API tool that turns any website into neat, structured data ready for LLMs in seconds.</p>
<p>In this tutorial, we’ll look at two ways of using Firecrawl. One is through Firecrawl’s API (a paid API with a free tier) and the other is a self-hosted version.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-what-is-firecrawl">What Is Firecrawl?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-why-llms-need-clean-data">Why LLMs Need Clean Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-setting-up-firecrawl">Setting Up Firecrawl</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scraping-a-single-page">Scraping a Single Page</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-crawling-an-entire-website">Crawling an Entire Website</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-extracting-structured-data-with-ai">Extracting Structured Data with AI</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-self-hosting-firecrawl-using-sevalla">Self-hosting Firecrawl using Sevalla</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-use-cases">Use Cases</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-what-is-firecrawl">What Is Firecrawl?</h2>
<p><a target="_blank" href="https://www.firecrawl.dev/">Firecrawl</a> is a web crawling and scraping service that helps developers collect clean data from websites. You give it a URL, and it returns the content in formats like Markdown, HTML, JSON, or even screenshots.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760534000207/93e57884-c611-40cc-b7be-4fe7d3c1ac5c.png" alt="Firecrawl illustrated - open source and cloud version" class="image--center mx-auto" width="1000" height="305" loading="lazy"></p>
<p>Unlike basic scrapers, Firecrawl understands complex websites that load content with JavaScript. It can crawl through links, follow pages, and handle the heavy lifting like proxies and anti-bot systems automatically.</p>
<p>In short, it does the hard part of web data collection, so you can focus on using that data for your AI or automation projects.</p>
<h2 id="heading-why-llms-need-clean-data">Why LLMs Need Clean Data</h2>
<p>LLMs learn and respond based on the text you give them. If that text includes clutter like HTML tags, scripts, or irrelevant sections, the AI gets confused.</p>
<p>Clean, well-structured data helps the model stay focused on the real content, like the article body, product details, or documentation.</p>
<p>Firecrawl makes this process simple. Instead of spending hours building scrapers or cleaning text, you can get ready-to-use content in a single API call.</p>
<h2 id="heading-setting-up-firecrawl">Setting Up Firecrawl</h2>
<p>To get started, create an account on <a target="_blank" href="https://firecrawl.dev/">firecrawl.dev</a> and grab your API key. Running Firecrawl on your machine includes setting up a server, Redis cache, and so on. So we’ll use the API key from firecrawl.dev to test the API.</p>
<p>We can also quickly test its capabilities in the UI of the website.</p>
<p>Let’s use <a target="_blank" href="https://freecodecamp.org/">https://freecodecamp.org</a> as the domain to see if Firecrawl can return some results.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760534104974/eb7ebdb4-d91f-4c49-92d9-b1c56c86ebb1.png" alt="Crawling freeCodeCamp" class="image--center mx-auto" width="1000" height="412" loading="lazy"></p>
<p>And yes, we can see several URLs scraped by Firecrawl.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760534508920/5f6423e6-aef2-4935-a821-0246fd96c12e.png" alt="Firecrawl results" class="image--center mx-auto" width="1000" height="524" loading="lazy"></p>
<p>Now let’s access Firecrawl using code. The free plan lets you scrape 500 pages, so its all we need to understand how it works.</p>
<p>You can use either the <a target="_blank" href="https://docs.firecrawl.dev/sdks/python">Python SDK</a>, the <a target="_blank" href="https://docs.firecrawl.dev/sdks/node">Node.js SDK</a>, or direct API requests with curl.</p>
<p>Here’s how you install the SDKs:</p>
<p>Python:</p>
<pre><code class="lang-plaintext">pip install firecrawl-py
</code></pre>
<p>Node.js:</p>
<pre><code class="lang-plaintext">npm install @mendable/firecrawl-js
</code></pre>
<p>Once installed, you just need to set your API key and you’re ready to crawl.</p>
<h2 id="heading-scraping-a-single-page">Scraping a Single Page</h2>
<p>Let’s say you want to extract the main content from Firecrawl’s homepage. You can do this in just a few lines.</p>
<p><strong>Python Example:</strong></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> firecrawl <span class="hljs-keyword">import</span> Firecrawl
</code></pre>
<pre><code class="lang-python">firecrawl = Firecrawl(api_key=<span class="hljs-string">"fc-YOUR_API_KEY"</span>)
</code></pre>
<pre><code class="lang-python">doc = firecrawl.scrape(
    <span class="hljs-string">"https://firecrawl.dev"</span>,
    formats=[<span class="hljs-string">"markdown"</span>, <span class="hljs-string">"html"</span>]
)
</code></pre>
<pre><code class="lang-python">print(doc.markdown)
</code></pre>
<p>This script returns the cleaned version of the page in Markdown format, perfect for an LLM to read or analyze.</p>
<p>With this one command, you get the core text, free from HTML clutter.</p>
<h2 id="heading-crawling-an-entire-website">Crawling an Entire Website</h2>
<p>If you need data from multiple pages like a full documentation site, you can crawl the entire domain. Firecrawl finds all the links and scrapes them automatically.</p>
<p>Example API call:</p>
<pre><code class="lang-plaintext">curl -X POST https://api.firecrawl.dev/v2/crawl \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://docs.firecrawl.dev",
    "limit": 10,
    "scrapeOptions": {
      "formats": ["markdown", "html"]
    }
  }'
</code></pre>
<p>This starts a crawl job and returns a job ID. Once done, you can download all the scraped pages in clean, LLM-ready formats.</p>
<h2 id="heading-extracting-structured-data-with-ai">Extracting Structured Data with AI</h2>
<p>One of Firecrawl’s best features is AI-powered extraction. You can ask Firecrawl to read a page and return structured data, like a product’s price, description, or reviews, in JSON format.</p>
<p>Example:</p>
<pre><code class="lang-plaintext">curl -X POST https://api.firecrawl.dev/v2/extract \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "urls": ["https://firecrawl.dev/*"],
    "prompt": "Extract the company mission and whether it is open source.",
    "schema": {
      "type": "object",
      "properties": {
        "company_mission": { "type": "string" },
        "is_open_source": { "type": "boolean" }
      }
    }
  }'
</code></pre>
<p>Firecrawl uses a built-in LLM to read the content and fill in the structure automatically. You can even skip the schema and just provide a natural-language prompt, like:</p>
<blockquote>
<p><em>“Extract all the pricing details and feature names from this page.”</em></p>
</blockquote>
<p>This is ideal for AI pipelines, RAG (Retrieval-Augmented Generation) systems, or dashboards that rely on clean, structured data.</p>
<h2 id="heading-self-hosting-firecrawl-using-sevalla">Self-hosting Firecrawl using Sevalla</h2>
<p>Firecrawl is open source, which means you don’t have to pay for the API if you prefer full control. You can deploy it on your own server and customise it however you like.</p>
<p>You can install Firecrawl on your local machine by setting up a database, cache, and other required components. But this setup will only work for local projects and won’t allow you to build or deploy applications that use Firecrawl.</p>
<p>To install Firecrawl, you can choose any cloud provider like <a target="_blank" href="https://aws.amazon.com/">AWS</a>, <a target="_blank" href="https://www.heroku.com/">Heroku</a>, or others to setup this project. But I will be using Sevalla.</p>
<p><a target="_blank" href="https://sevalla.com/">Sevalla</a> is a modern, usage-based Platform-as-a-service provider. It offers application hosting, database, object storage, and static site hosting for your projects.</p>
<p>I am using Sevalla for hosting for two reasons:</p>
<ul>
<li><p>Every platform will charge you for creating a cloud resource. Sevalla comes with a $50 credit for us to use, so we won't incur any costs for this example.</p>
</li>
<li><p>Sevalla has a <a target="_blank" href="https://docs.sevalla.com/templates/overview">template for Firecrawl</a>, so it simplifies the manual installation and setup for each resource you will need for Firecrawl.</p>
</li>
</ul>
<p><a target="_blank" href="https://app.sevalla.com/login">Login</a> to Sevalla and click on Templates. You can see Firecrawl as one of the templates.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760534530797/6d327148-5c6f-40cc-863a-2ad9d82763bd.png" alt="Sevalla Templates" class="image--center mx-auto" width="1000" height="278" loading="lazy"></p>
<p>Click “Deploy now” and choose a server in the pop-up, and click “Deploy”. Sevalla will start provisioning the resources we need for running our Firecrawl instance.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760534550817/38525217-89cb-41b3-8128-85164734b764.png" alt="Firecrawl resources" class="image--center mx-auto" width="1000" height="687" loading="lazy"></p>
<p>Once the deployment is complete, you will see three instances provisioned:</p>
<ul>
<li><p>a <a target="_blank" href="https://redis.io/">Redis Cache</a></p>
</li>
<li><p>a server to run <a target="_blank" href="https://playwright.dev/">Playwright</a></p>
</li>
<li><p>The API application</p>
</li>
</ul>
<p>Go to the Firecrawl-API application. Under the deployments section, click on “Visit app” once the deployment is complete.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760534571061/3f4db002-c775-445f-a1fc-af1aefff2d86.png" alt="Firecrawl Deployment" class="image--center mx-auto" width="1000" height="427" loading="lazy"></p>
<p>You can now use your private endpoint in your applications. My API URL is <a target="_blank" href="https://firecrawl-api-56t8x.sevalla.app/">https://firecrawl-api-56t8x.sevalla.app</a> (this is a temporary URL – dont use this), so I can replace api.firecrawl.dev with this URL.</p>
<pre><code class="lang-plaintext">curl -X POST https://firecrawl-api-56t8x.sevalla.app/v2/extract \
  -H 'Content-Type: application/json' \
  -d '{
    "urls": ["https://firecrawl.dev/*"],
    "prompt": "Extract the company mission and whether it is open source.",
    "schema": {
      "type": "object",
      "properties": {
        "company_mission": { "type": "string" },
        "is_open_source": { "type": "boolean" }
      }
    }
  }'
</code></pre>
<p>If you want to run the project locally by installing applications like Redis, Postgresql, and Playwright, <a target="_blank" href="https://github.com/firecrawl/firecrawl/blob/main/CONTRIBUTING.md">here’s a detailed guide</a>.</p>
<h2 id="heading-use-cases">Use Cases</h2>
<p>Developers and data scientists use Firecrawl for a wide range of tasks. They often rely on it to turn documentation sites into training data for large language models, ensuring that their models can learn from accurate and well-organised sources.</p>
<p>Others use it to collect blog posts or news articles for <a target="_blank" href="https://www.turingtalks.ai/p/how-to-build-a-simple-sentiment-analyzer-using-hugging-face-transformer">sentiment analysis</a>, helping them understand trends, opinions, or public reactions across the web.</p>
<p>Firecrawl is also valuable for monitoring web content changes, which is essential for research projects or compliance tracking where up-to-date information is critical.</p>
<p>Teams can also use it to build “chat with your website” AI assistants that can answer questions based on the latest site content.</p>
<p>In each of these cases, Firecrawl ensures that your model receives clean, structured, and consistent data, making it easier to build reliable and intelligent AI systems.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Turning messy websites into readable text used to be one of the toughest parts of building AI systems. Firecrawl changes that. With one API call, you can scrape, crawl, and extract high-quality data that your LLM can immediately understand.</p>
<p>If you’re building anything related to AI, RAG, or data pipelines, Firecrawl is one of those tools you’ll wish you had discovered earlier.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use Python to Build Your Own Web Scraper ]]>
                </title>
                <description>
                    <![CDATA[ By Jess Wilk What is Web scraping? Web scraping is a technique used to collect large amounts of data automatically using a programming script. This makes it useful for many professionals such as data analysts, market researchers, SEO specialists, bus... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/use-python-sdk-to-build-a-web-scraper/</link>
                <guid isPermaLink="false">66d45f6d38f2dc3808b790c5</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ web scraping ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Wed, 10 Jul 2024 13:11:06 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/07/ilya-pavlov-OqtafYT5kTw-unsplash.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Jess Wilk</p>
<h2 id="heading-what-is-web-scraping"><strong>What is Web scraping?</strong></h2>
<p>Web scraping is a technique used to collect large amounts of data automatically using a programming script. This makes it useful for many professionals such as data analysts, market researchers, SEO specialists, business analysts, and academic researchers.</p>
<h2 id="heading-what-youll-learn-here"><strong>What You'll Learn Here</strong></h2>
<p>Python provides two libraries, Requests and Beautiful Soup, that help you scrape websites more easily. The combined use of Python's Requests and Beautiful Soup can retrieve HTML content from a website and then parse it to extract the data you need. In this article, I'll show you how to use these libraries with an example.</p>
<p>By the end of this guide, you will be equipped to build your own Web Scraper and have a more profound understanding of working with a large amount of data and how to apply it to make data-driven decisions.</p>
<p>Please note that while a web scraper is a useful tool, make sure you're compliant with all legal guidelines. This involves respecting the website's <code>robots.txt</code> file and adhering to the terms of service so you avoid unauthorized data extraction. </p>
<p>Also, before scraping, make sure that the scraping process does not harm the website's functionality or overload its servers. Finally, respect data privacy by not scraping personal or sensitive information without proper consent.</p>
<h2 id="heading-how-beautiful-soup-and-python-requests-work-together"><strong>How Beautiful Soup and Python Requests Work Together</strong></h2>
<p>Let’s understand the role of each library. </p>
<p>The Python Requests library is responsible for fetching HTML content from the URL you provide in the script. Once it retrieves the content, it stores the data in a response object. </p>
<p>Beautiful Soup then takes over, transforming the raw HTML from the Requests response into a structured format and parsing it. You can then scrape data from the parsed HTML by specifying attributes, allowing you to automate the collection of specific data from websites or repositories.</p>
<p>But this duo has its limitations. The Requests library can’t handle websites with dynamic JavaScript content. So you should use it primarily for sites that serve static content from servers. If you need to scrape a dynamically loaded site, you will have to use more advanced automation tools like Selenium.</p>
<h2 id="heading-how-to-build-a-web-scraper-with-python"><strong>How to Build a Web Scraper with Python</strong></h2>
<p>Now that we understand what Beautiful Soup and Python Requests can do, let’s discuss how we can scrape data using these tools.</p>
<p>In the following example, we’ll be scraping data from the <a target="_blank" href="https://archive.ics.uci.edu/datasets">UC Irvine Machine Learning Repository</a>. </p>
<p><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXd2MTmii-KD8tu6AAeHhbr9Sb5vauq3jC3AcYc2Yvd4kcCQLdTdVrBqZuFOpF-vKQ3E012hV7W6bm0iOtqrCsvJx6xsT165mKqbKVC8Kf48ZxOMq-Joi7n2jDw6fl3AM4XLVBuikCJpXTIB6c6JriJtP9MQ?key=f_hrU3B_rjNJFpKZiiV3Pw" alt="Image" width="600" height="400" loading="lazy">
<em>Datasets at the UC Irvine Machine Learning Repository</em></p>
<p>As you can see, it contains many datasets, and you can find further details about each dataset by going to a dedicated page for the dataset. You can access the dedicated page by clicking on the dataset name in the list above. </p>
<p>Check out the image below to get an idea of the information provided for each dataset.</p>
<p><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXcb7_BVgpIh1P931U-HHX6BKIPN1ODKRzc6WqjX-n77uA9Uvz_e80wqc2YtJx2-Rq3HzWKtlDE31gV-7jz0UASzKrhq86X45paNDkVVO5oNXeaRZ99vIs45g1TwMk54hpyEetzyuDjMgPYW4KKW-oPhKjh8?key=f_hrU3B_rjNJFpKZiiV3Pw" alt="Image" width="600" height="400" loading="lazy">
<em>Iris dataset</em></p>
<p>The code we write below will go through each dataset, scrape the details, and save them to a CSV file.</p>
<h3 id="heading-prerequisites">Prerequisites</h3>
<p>To try out this tutorial, you need several prerequisites set up.</p>
<p>I am assuming you already have a Python installation on your machine. If not, please download the latest Python from the <a target="_blank" href="https://www.python.org/downloads/">official website</a>.</p>
<p>The Requests and Beautiful Soup libraries don't come with Python. You will have to install them separately. For this, you can use the pip package manager which is included by default with Python installation since Python 3.4.</p>
<p>You can use pip to install the Requests and Beautiful Soup libraries using the following commands:</p>
<pre><code class="lang-python">pip install requests
pip install beautifulsoup4
</code></pre>
<p>If they were successfully installed, now you are ready to start coding.</p>
<h3 id="heading-step-1-import-necessary-libraries">Step 1: Import Necessary Libraries</h3>
<p>First, import the necessary libraries: Requests for making HTTP requests, BeautifulSoup for parsing HTML content (if you don't already have it installed from the previous step), and CSV for saving the data.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup
<span class="hljs-keyword">import</span> csv
</code></pre>
<h3 id="heading-step-2-define-the-base-url-and-csv-headers">Step 2: Define the Base URL and CSV Headers</h3>
<p>Set the base URL for the dataset listings and define the headers for the CSV file where the scraped data will be saved.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">scrape_uci_datasets</span>():</span>
    base_url = <span class="hljs-string">"https://archive.ics.uci.edu/datasets"</span>


    headers = [
        <span class="hljs-string">"Dataset Name"</span>, <span class="hljs-string">"Donated Date"</span>, <span class="hljs-string">"Description"</span>,
        <span class="hljs-string">"Dataset Characteristics"</span>, <span class="hljs-string">"Subject Area"</span>, <span class="hljs-string">"Associated Tasks"</span>,
        <span class="hljs-string">"Feature Type"</span>, <span class="hljs-string">"Instances"</span>, <span class="hljs-string">"Features"</span>
    ]


    data = []
</code></pre>
<h3 id="heading-step-3-create-a-function-to-scrape-dataset-details">Step 3: Create a Function to Scrape Dataset Details</h3>
<p>Define a function <code>scrape_dataset_details</code> that takes the URL of an individual dataset page, retrieves the HTML content, parses it using BeautifulSoup, and extracts relevant information.</p>
<pre><code class="lang-python">
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">scrape_dataset_details</span>(<span class="hljs-params">dataset_url</span>):</span>
        response = requests.get(dataset_url)
        soup = BeautifulSoup(response.text, <span class="hljs-string">'html.parser'</span>)


        dataset_name = soup.find(
            <span class="hljs-string">'h1'</span>, class_=<span class="hljs-string">'text-3xl font-semibold text-primary-content'</span>)
        dataset_name = dataset_name.text.strip() <span class="hljs-keyword">if</span> dataset_name <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>


        donated_date = soup.find(<span class="hljs-string">'h2'</span>, class_=<span class="hljs-string">'text-sm text-primary-content'</span>)
        donated_date = donated_date.text.strip().replace(
            <span class="hljs-string">'Donated on '</span>, <span class="hljs-string">''</span>) <span class="hljs-keyword">if</span> donated_date <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>


        description = soup.find(<span class="hljs-string">'p'</span>, class_=<span class="hljs-string">'svelte-17wf9gp'</span>)
        description = description.text.strip() <span class="hljs-keyword">if</span> description <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>


        details = soup.find_all(<span class="hljs-string">'div'</span>, class_=<span class="hljs-string">'col-span-4'</span>)


        dataset_characteristics = details[<span class="hljs-number">0</span>].find(<span class="hljs-string">'p'</span>).text.strip() <span class="hljs-keyword">if</span> len(
            details) &gt; <span class="hljs-number">0</span> <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>
        subject_area = details[<span class="hljs-number">1</span>].find(<span class="hljs-string">'p'</span>).text.strip() <span class="hljs-keyword">if</span> len(
            details) &gt; <span class="hljs-number">1</span> <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>
        associated_tasks = details[<span class="hljs-number">2</span>].find(<span class="hljs-string">'p'</span>).text.strip() <span class="hljs-keyword">if</span> len(
            details) &gt; <span class="hljs-number">2</span> <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>
        feature_type = details[<span class="hljs-number">3</span>].find(<span class="hljs-string">'p'</span>).text.strip() <span class="hljs-keyword">if</span> len(
            details) &gt; <span class="hljs-number">3</span> <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>
        instances = details[<span class="hljs-number">4</span>].find(<span class="hljs-string">'p'</span>).text.strip() <span class="hljs-keyword">if</span> len(
            details) &gt; <span class="hljs-number">4</span> <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>
        features = details[<span class="hljs-number">5</span>].find(<span class="hljs-string">'p'</span>).text.strip() <span class="hljs-keyword">if</span> len(
            details) &gt; <span class="hljs-number">5</span> <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>


        <span class="hljs-keyword">return</span> [
            dataset_name, donated_date, description, dataset_characteristics,
            subject_area, associated_tasks, feature_type, instances, features
        ]
</code></pre>
<p>The <code>scrape_dataset_details</code> function retrieves the HTML content of a dataset page and parses it using BeautifulSoup. It extracts information by targeting specific HTML elements based on their tags and classes, such as dataset names, donation dates, and descriptions. </p>
<p>The function uses methods like <code>find</code> and <code>find_all</code> to locate these elements and retrieve their text content, handling cases where elements might be missing by providing default values. </p>
<p>This systematic approach ensures that the relevant details are accurately captured and returned in a structured format.</p>
<h3 id="heading-step-4-create-a-function-to-scrape-dataset-listings">Step 4: Create a Function to Scrape Dataset Listings</h3>
<p>Define a function <code>scrape_datasets</code> that takes the URL of a page listing multiple datasets, retrieves the HTML content, and finds all dataset links. For each link, it calls <code>scrape_dataset_details</code> to get detailed information.</p>
<pre><code class="lang-python">    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">scrape_datasets</span>(<span class="hljs-params">page_url</span>):</span>
        response = requests.get(page_url)
        soup = BeautifulSoup(response.text, <span class="hljs-string">'html.parser'</span>)


        dataset_list = soup.find_all(
            <span class="hljs-string">'a'</span>, class_=<span class="hljs-string">'link-hover link text-xl font-semibold'</span>)


        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> dataset_list:
            print(<span class="hljs-string">"No dataset links found"</span>)
            <span class="hljs-keyword">return</span>


        <span class="hljs-keyword">for</span> dataset <span class="hljs-keyword">in</span> dataset_list:
            dataset_link = <span class="hljs-string">"https://archive.ics.uci.edu"</span> + dataset[<span class="hljs-string">'href'</span>]
            print(<span class="hljs-string">f"Scraping details for <span class="hljs-subst">{dataset.text.strip()}</span>..."</span>)
            dataset_details = scrape_dataset_details(dataset_link)
            data.append(dataset_details)
</code></pre>
<h3 id="heading-step-5-loop-through-pages-using-pagination-parameters">Step 5: Loop Through Pages Using Pagination Parameters</h3>
<p>Implement a loop to navigate through the pages using pagination parameters. The loop continues until no new data is added, indicating that all pages have been scraped.</p>
<pre><code class="lang-python">    skip = <span class="hljs-number">0</span>
    take = <span class="hljs-number">10</span>
    <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
        page_url = <span class="hljs-string">f"https://archive.ics.uci.edu/datasets?skip=<span class="hljs-subst">{skip}</span>&amp;take=<span class="hljs-subst">{take}</span>&amp;sort=desc&amp;orderBy=NumHits&amp;search="</span>
        print(<span class="hljs-string">f"Scraping page: <span class="hljs-subst">{page_url}</span>"</span>)
        initial_data_count = len(data)
        scrape_datasets(page_url)
        <span class="hljs-keyword">if</span> len(
                data
        ) == initial_data_count:  
            <span class="hljs-keyword">break</span>
        skip += take
</code></pre>
<h3 id="heading-step-6-save-the-scraped-data-to-a-csv-file">Step 6: Save the Scraped Data to a CSV File</h3>
<p>After scraping all the data, save it to a CSV file.</p>
<pre><code class="lang-python">    <span class="hljs-keyword">with</span> open(<span class="hljs-string">'uci_datasets.csv'</span>, <span class="hljs-string">'w'</span>, newline=<span class="hljs-string">''</span>, encoding=<span class="hljs-string">'utf-8'</span>) <span class="hljs-keyword">as</span> file:
        writer = csv.writer(file)
        writer.writerow(headers)
        writer.writerows(data)


    print(<span class="hljs-string">"Scraping complete. Data saved to 'uci_datasets.csv'."</span>)
</code></pre>
<h3 id="heading-step-7-run-the-scraping-function">Step 7: Run the Scraping Function</h3>
<p>Finally, call the <code>scrape_uci_datasets</code> function to start the scraping process.</p>
<pre><code class="lang-python">scrape_uci_datasets()
</code></pre>
<h2 id="heading-full-code"><strong>Full Code</strong></h2>
<p>Here is the complete code for the web scraper:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup
<span class="hljs-keyword">import</span> csv


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">scrape_uci_datasets</span>():</span>
    base_url = <span class="hljs-string">"https://archive.ics.uci.edu/datasets"</span>


    headers = [
        <span class="hljs-string">"Dataset Name"</span>, <span class="hljs-string">"Donated Date"</span>, <span class="hljs-string">"Description"</span>,
        <span class="hljs-string">"Dataset Characteristics"</span>, <span class="hljs-string">"Subject Area"</span>, <span class="hljs-string">"Associated Tasks"</span>,
        <span class="hljs-string">"Feature Type"</span>, <span class="hljs-string">"Instances"</span>, <span class="hljs-string">"Features"</span>
    ]


    <span class="hljs-comment"># List to store the scraped data</span>
    data = []


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">scrape_dataset_details</span>(<span class="hljs-params">dataset_url</span>):</span>
        response = requests.get(dataset_url)
        soup = BeautifulSoup(response.text, <span class="hljs-string">'html.parser'</span>)


        dataset_name = soup.find(
            <span class="hljs-string">'h1'</span>, class_=<span class="hljs-string">'text-3xl font-semibold text-primary-content'</span>)
        dataset_name = dataset_name.text.strip() <span class="hljs-keyword">if</span> dataset_name <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>


        donated_date = soup.find(<span class="hljs-string">'h2'</span>, class_=<span class="hljs-string">'text-sm text-primary-content'</span>)
        donated_date = donated_date.text.strip().replace(
            <span class="hljs-string">'Donated on '</span>, <span class="hljs-string">''</span>) <span class="hljs-keyword">if</span> donated_date <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>


        description = soup.find(<span class="hljs-string">'p'</span>, class_=<span class="hljs-string">'svelte-17wf9gp'</span>)
        description = description.text.strip() <span class="hljs-keyword">if</span> description <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>


        details = soup.find_all(<span class="hljs-string">'div'</span>, class_=<span class="hljs-string">'col-span-4'</span>)


        dataset_characteristics = details[<span class="hljs-number">0</span>].find(<span class="hljs-string">'p'</span>).text.strip() <span class="hljs-keyword">if</span> len(
            details) &gt; <span class="hljs-number">0</span> <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>
        subject_area = details[<span class="hljs-number">1</span>].find(<span class="hljs-string">'p'</span>).text.strip() <span class="hljs-keyword">if</span> len(
            details) &gt; <span class="hljs-number">1</span> <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>
        associated_tasks = details[<span class="hljs-number">2</span>].find(<span class="hljs-string">'p'</span>).text.strip() <span class="hljs-keyword">if</span> len(
            details) &gt; <span class="hljs-number">2</span> <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>
        feature_type = details[<span class="hljs-number">3</span>].find(<span class="hljs-string">'p'</span>).text.strip() <span class="hljs-keyword">if</span> len(
            details) &gt; <span class="hljs-number">3</span> <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>
        instances = details[<span class="hljs-number">4</span>].find(<span class="hljs-string">'p'</span>).text.strip() <span class="hljs-keyword">if</span> len(
            details) &gt; <span class="hljs-number">4</span> <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>
        features = details[<span class="hljs-number">5</span>].find(<span class="hljs-string">'p'</span>).text.strip() <span class="hljs-keyword">if</span> len(
            details) &gt; <span class="hljs-number">5</span> <span class="hljs-keyword">else</span> <span class="hljs-string">"N/A"</span>


        <span class="hljs-keyword">return</span> [
            dataset_name, donated_date, description, dataset_characteristics,
            subject_area, associated_tasks, feature_type, instances, features
        ]


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">scrape_datasets</span>(<span class="hljs-params">page_url</span>):</span>
        response = requests.get(page_url)
        soup = BeautifulSoup(response.text, <span class="hljs-string">'html.parser'</span>)


        dataset_list = soup.find_all(
            <span class="hljs-string">'a'</span>, class_=<span class="hljs-string">'link-hover link text-xl font-semibold'</span>)


        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> dataset_list:
            print(<span class="hljs-string">"No dataset links found"</span>)
            <span class="hljs-keyword">return</span>


        <span class="hljs-keyword">for</span> dataset <span class="hljs-keyword">in</span> dataset_list:
            dataset_link = <span class="hljs-string">"https://archive.ics.uci.edu"</span> + dataset[<span class="hljs-string">'href'</span>]
            print(<span class="hljs-string">f"Scraping details for <span class="hljs-subst">{dataset.text.strip()}</span>..."</span>)
            dataset_details = scrape_dataset_details(dataset_link)
            data.append(dataset_details)


    <span class="hljs-comment"># Loop through the pages using the pagination parameters</span>
    skip = <span class="hljs-number">0</span>
    take = <span class="hljs-number">10</span>
    <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
        page_url = <span class="hljs-string">f"https://archive.ics.uci.edu/datasets?skip=<span class="hljs-subst">{skip}</span>&amp;take=<span class="hljs-subst">{take}</span>&amp;sort=desc&amp;orderBy=NumHits&amp;search="</span>
        print(<span class="hljs-string">f"Scraping page: <span class="hljs-subst">{page_url}</span>"</span>)
        initial_data_count = len(data)
        scrape_datasets(page_url)
        <span class="hljs-keyword">if</span> len(
                data
        ) == initial_data_count: 
            <span class="hljs-keyword">break</span>
        skip += take


    <span class="hljs-keyword">with</span> open(<span class="hljs-string">'uci_datasets.csv'</span>, <span class="hljs-string">'w'</span>, newline=<span class="hljs-string">''</span>, encoding=<span class="hljs-string">'utf-8'</span>) <span class="hljs-keyword">as</span> file:
        writer = csv.writer(file)
        writer.writerow(headers)
        writer.writerows(data)


    print(<span class="hljs-string">"Scraping complete. Data saved to 'uci_datasets.csv'."</span>)


scrape_uci_datasets()
</code></pre>
<p>Once you run the script, it will run for a while until the terminal says “No dataset links found”, followed by “Scraping complete. Data saved to 'uci_datasets.csv'”, indicating that the scraped data has been saved in a CSV file.</p>
<p><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXdRUvJJsu32oaxdattur__98CEF9GvqQMDTDQzpqS-NW3I2-haF5tfWH_mIBFwEhAqLhUhURVKCNFJE-b1bRzeZtz2oApWePqLZqWahKT0uhoXN0Ok7JJQnWN32dWQOHclZ2y9hg2MdqvoLDhToy-gCj9o?key=f_hrU3B_rjNJFpKZiiV3Pw" alt="Image" width="600" height="400" loading="lazy"></p>
<p>To view the scraped data, open the 'uci_datasets.csv', you should be able to see the data organized by Dataset Name, Donated Date, Description, Characteristics, Subject Area, and so on.</p>
<p><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXd1ZkPzSyPxZ3KsZklCPPcruSll4xUBxm3KiNdageDzHK-wbTxG7v8HLFpoJ-gMvIpdKPxzoshzRlmNjiPeVcbvse14gdGFHu7Wm89UgTACtImpToHOkqcU29S6s31CzC_T20h1bUO4w0D9sLFC_5Tmy3o?key=f_hrU3B_rjNJFpKZiiV3Pw" alt="Image" width="600" height="400" loading="lazy">
<em>Data organized by Dataset Name, Donated Date, Description, Characteristics, Subject Area, and so on.</em></p>
<p>You can have a better view of the data if you open the file via Excel.</p>
<p><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXfdmf621HGzQNHCdgxTJ6cvl2YEpuAq5hfvqpE9KrbZ8kDkGo6R3YIYpCFMmNoY8z29YEfcesZap9hpxiLc3fwHEyzLdo6dNQGNExRdam3t3taUebgKL_ocDFXyo2KhhMTpGDod2sUQI5miEUp_UCyNPZo?key=f_hrU3B_rjNJFpKZiiV3Pw" alt="Image" width="600" height="400" loading="lazy">
<em>Data organized in Excel file</em></p>
<p>By following the logic mentioned in this article, you can scrape many sites. All you need to do is start from the base URL, figure out how to navigate through the list, and go to the dedicated page for each list item. Then, identify suitable page elements like IDs and classes where you can isolate and extract the data you want. </p>
<p>You also need to understand the logic behind pagination. Most often, pagination makes slight changes to the URL, which you can use to loop from one page to another. </p>
<p>Finally, you can write the data to a CSV file, which is suitable for storing and as input for visualization.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Using Python along with Requests and Beautiful Soup allows you to create fully functional web scrapers to extract data from websites. While this functionality can be highly advantageous for data-driven decision-making, it is important to keep ethical and legal considerations in mind.</p>
<p>Once you become familiar with the methods used in this script, you can explore techniques like proxy management and data persistence. You can also familiarize yourself with other libraries like Scrapy, Selenium, and Puppeteer to fulfill your data collection needs. </p>
<p>Thank you for reading! I'm Jess, and I'm an expert at Hyperskill. You can check out my <strong>Python</strong> developer course on the platform.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Scrape Amazon Product Reviews Behind a Login ]]>
                </title>
                <description>
                    <![CDATA[ By Satyam Tripathi Amazon is the most popular e-commerce website for web scrapers, with billions of product pages being scraped every month.  It is also home to a vast database of product reviews, which can be very useful for market research and comp... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-scrape-amazon-product-reviews-behind-a-login/</link>
                <guid isPermaLink="false">66d461744bc8f441cb6df837</guid>
                
                    <category>
                        <![CDATA[ node js ]]>
                    </category>
                
                    <category>
                        <![CDATA[ puppeteer ]]>
                    </category>
                
                    <category>
                        <![CDATA[ web scraping ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Mon, 30 Oct 2023 16:46:40 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/10/pexels-pixabay-159751--1-.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Satyam Tripathi</p>
<p>Amazon is the most popular e-commerce website for web scrapers, with billions of product pages being scraped every month. </p>
<p>It is also home to a vast database of product reviews, which can be very useful for market research and competitor monitoring. </p>
<p>You can extract relevant data from the Amazon website and save it in a spreadsheet or JSON format. And you can even automate the process to update the data regularly.</p>
<p>Scraping Amazon product reviews is not always straightforward, especially when a login is required. In this guide, you'll learn how to scrape Amazon product reviews behind a login. You’ll learn the process of logging in, parsing review data, and exporting reviews to CSV.</p>
<p><strong>Important Disclaimer:</strong> This tutorial is for educational purposes only. Scraping data from behind logins on websites may violate their terms and conditions (T&amp;Cs).  It's crucial to always check the T&amp;Cs of any website before scraping data.</p>
<p>Without further ado, let's get started.</p>
<h2 id="heading-prerequisites-and-project-setup">Prerequisites and Project Setup</h2>
<p>We’ll use the Node.js Puppeteer library to scrape Amazon reviews. Make sure Node.js is installed on your system. If it is not, go to the official <a target="_blank" href="https://nodejs.org/en">Node.js website</a> and install it. </p>
<p>After Node.js is installed, install Puppeteer. <a target="_blank" href="https://github.com/puppeteer/puppeteer">Puppeteer</a> is a Node.js library that provides a high-level, user-friendly API for automating tasks and interacting with dynamic web pages. </p>
<p>Now, let's install and configure Puppeteer.</p>
<p>Open a terminal and create a new folder with any name. (In my case, it is _amazon<em>reviews</em>).</p>
<pre><code class="lang-bash">mkdir amazon_reviews
</code></pre>
<p>Change your current directory to the folder created above.</p>
<pre><code class="lang-bash"><span class="hljs-built_in">cd</span> amazon_reviews
</code></pre>
<p>Cool, you're now in the correct directory. Execute the following command to initialize the <em>package.json</em> file:</p>
<pre><code class="lang-bash">npm init -y
</code></pre>
<p>Finally, install Puppeteer using the following command:</p>
<pre><code class="lang-bash">npm install puppeteer
</code></pre>
<p>This is what the process looks like:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/10/Screenshot-2023-10-27-070530.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Now, open the folder in any code editor, and create a new JavaScript file (index.js). Make sure that the hierarchy looks like this:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/10/Screenshot-2023-10-27-070823.png" alt="Image" width="600" height="400" loading="lazy">
_Hierarchy showing <code>node_modules</code>, <code>index.js</code>, <code>package-lock.json</code>, and <code>package.json</code>_</p>
<p>All set up successfully. We’re now ready to code the scraper.</p>
<p><strong>Note:</strong> Ensure that you have an account on Amazon so you can progress through the rest of this tutorial.</p>
<h2 id="heading-step-1-get-access-to-the-public-page">Step 1: Get Access to the Public Page</h2>
<p>You're going to scrape the reviews of the product shown below. You’ll extract the author's name, review title, and date.</p>
<p>Here's the product URL: <a target="_blank" href="https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/">https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/</a></p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/10/Screenshot-2023-10-27-072923.png" alt="Image" width="600" height="400" loading="lazy">
<em>The product we're using in the example - headphones</em></p>
<p>First, you’ll log in to Amazon, and then redirect to the product URL to scrape the reviews.</p>
<h2 id="heading-step-2-scrape-behind-the-login">Step 2: Scrape Behind the Login</h2>
<p>Amazon's multi-stage login process requires users to enter their username or email, click a Continue button to enter their password, and then finally submit it. Both the username and password fields are typically on different pages.</p>
<p>To enter the email ID, use the selector <code>input[name=email]</code>.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/10/Screenshot-2023-10-27-082325.png" alt="Image" width="600" height="400" loading="lazy">
<em>HTML of the sign-in field</em></p>
<p>Now, click on the Continue button using the selector <code>input[id=continue]</code>.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/10/Screenshot-2023-10-27-083136.png" alt="Image" width="600" height="400" loading="lazy">
<em>HTML of the continue button</em></p>
<p>Now you should be on the next page. To enter the password, use the selector <code>input[name=password]</code>.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/10/Screenshot-2023-10-27-083415.png" alt="Image" width="600" height="400" loading="lazy">
<em>HTML of the password field</em></p>
<p>Finally, click on the Sign In button using the selector <code>input[id=signInSubmit]</code>.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/10/Screenshot-2023-10-27-083833.png" alt="Image" width="600" height="400" loading="lazy">
<em>HTML of the sign-in button</em></p>
<p>Here’s the code for the login process:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> selectors = {
  <span class="hljs-attr">emailid</span>: <span class="hljs-string">'input[name=email]'</span>,
  <span class="hljs-attr">password</span>: <span class="hljs-string">'input[name=password]'</span>,
  <span class="hljs-attr">continue</span>: <span class="hljs-string">'input[id=continue]'</span>,
  <span class="hljs-attr">singin</span>: <span class="hljs-string">'input[id=signInSubmit]'</span>,
};


    <span class="hljs-keyword">await</span> page.goto(signinURL);
    <span class="hljs-keyword">await</span> page.waitForSelector(selectors.emailid);
    <span class="hljs-keyword">await</span> page.type(selectors.emailid, <span class="hljs-string">"satyam@gmail.com"</span>, { <span class="hljs-attr">delay</span>: <span class="hljs-number">100</span> });
    <span class="hljs-keyword">await</span> page.click(selectors.continue);
    <span class="hljs-keyword">await</span> page.waitForSelector(selectors.password);
    <span class="hljs-keyword">await</span> page.type(selectors.password, <span class="hljs-string">"mypassword"</span>, { <span class="hljs-attr">delay</span>: <span class="hljs-number">100</span> });
    <span class="hljs-keyword">await</span> page.click(selectors.singin);
    <span class="hljs-keyword">await</span> page.waitForNavigation();
</code></pre>
<p>We're following the same steps as discussed above. First, go to the sign-in URL, enter the email ID, and click on the Continue button. Then enter the password, click on the Sign In button, and wait for a moment for the sign-in process to complete.</p>
<p>After the sign-in process is completed, you’ll be redirected to the product page to scrape the reviews.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/10/Screenshot-2023-10-27-072923-1.png" alt="Image" width="600" height="400" loading="lazy">
<em>Product page</em></p>
<h2 id="heading-step-3-parse-the-review-data">Step 3: Parse the Review Data</h2>
<p>You've successfully logged in and are now on the product page that you want to scrape. Let's now parse the review data.</p>
<p>On the page, you'll find various reviews. These reviews are contained within a parent <code>div</code> with the ID <code>cm-cr-dp-review-list</code>, which holds all the reviews on the current page. If you want to access more reviews, you'll need to navigate to the next page using the pagination process.</p>
<p>This parent div has multiple child divs, and each child div holds one review. To extract the reviews, you can use the selector <code>#cm-cr-dp-review-list div.review</code>.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> selectors = {
  <span class="hljs-attr">allReviews</span>: <span class="hljs-string">'#cm-cr-dp-review-list div.review'</span>,
  <span class="hljs-attr">authorName</span>: <span class="hljs-string">'div[data-hook="genome-widget"] span.a-profile-name'</span>,
  <span class="hljs-attr">reviewTitle</span>: <span class="hljs-string">'[data-hook=review-title]&gt;span:not([class])'</span>,
  <span class="hljs-attr">reviewDate</span>: <span class="hljs-string">'span[data-hook=review-date]'</span>,
};
</code></pre>
<p>This selector shows that you first go to the element with the ID <code>cm-cr-dp-review-list</code>, then search for all <code>div</code> elements with the data-hook <code>review</code>. </p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/10/annotely_image.png" alt="Image" width="600" height="400" loading="lazy">
<em>Review data with Author name, Review Title, Description, etc.</em></p>
<p>The following code snippet shows that you should first go to the product URL, wait for the selector to load, and then scrape all the reviews and store them in the <code>reviewElements</code> variable.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">await</span> page.goto(productURL);
<span class="hljs-keyword">await</span> page.waitForSelector(selectors.allReviews);
<span class="hljs-keyword">const</span> reviewElements = <span class="hljs-keyword">await</span> page.$$(selectors.allReviews);
</code></pre>
<p>Now, let's extract the author's name, review title, and date.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/10/Screenshot-2023-10-27-091701.png" alt="Image" width="600" height="400" loading="lazy">
<em>Targetting Author name, Review Title, and Date</em></p>
<p>To parse the author name, you can use the selector <code>div[data-hook="genome-widget"] span.a-profile-name</code>. This selector tells us to first search for the <code>div</code> element with the <code>data-hook</code> attribute set to <code>genome-widget</code>, because the names are inside this <code>div</code> element. Then, search for the <code>span</code> element with the class name <code>a-profile-name</code>. This is the element that contains the author's name.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> author = <span class="hljs-keyword">await</span> reviewElement.$eval(selectors.authorName, <span class="hljs-function">(<span class="hljs-params">element</span>) =&gt;</span> element.textContent);
</code></pre>
<p>To parse the review title, you can use the CSS selector <code>[data-hook="review-title"] &gt; span:not([class])</code>. This selector tells us to search for the <code>span</code> element that is a direct child of the <code>[data-hook="review-title"]</code> element and that does not have a class attribute.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> title = <span class="hljs-keyword">await</span> reviewElement.$eval(selectors.reviewTitle, <span class="hljs-function">(<span class="hljs-params">element</span>) =&gt;</span> element.textContent);
</code></pre>
<p>To parse the date, you can use the CSS selector <code>span[data-hook="review-date"]</code>. This selector tells us to search for the span element that has the <code>data-hook</code> attribute set to <code>review-date</code>. This is the element that contains the review date.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> rawReviewDate = <span class="hljs-keyword">await</span> reviewElement.$eval(selectors.reviewDate, <span class="hljs-function">(<span class="hljs-params">element</span>) =&gt;</span> element.textContent);
</code></pre>
<p>Note that you’ll get the entire text, including the location, instead of just the full date. Therefore, you must use a regular expression pattern to extract the date from the text. </p>
<p>After that, combine all of the data into the <code>reviewData</code> and then push it to the final list <code>reviewsData</code>.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> datePattern = <span class="hljs-regexp">/(\w+\s\d{1,2},\s\d{4})/</span>;
      <span class="hljs-keyword">const</span> match = rawReviewDate.match(datePattern);
      <span class="hljs-keyword">const</span> reviewDate = match ? match[<span class="hljs-number">0</span>].replace(<span class="hljs-string">','</span>, <span class="hljs-string">''</span>) : <span class="hljs-string">"Date not found"</span>;

      <span class="hljs-keyword">const</span> reviewData = {
        author,
        title,
        reviewDate,
      };

      reviewsData.push(reviewData);
    }
</code></pre>
<p>The above process will run until it has parsed all of the reviews on the current page. Here’s the code snippet to parse the data:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">for</span> (<span class="hljs-keyword">const</span> reviewElement <span class="hljs-keyword">of</span> reviewElements) {
      <span class="hljs-keyword">const</span> author = <span class="hljs-keyword">await</span> reviewElement.$eval(selectors.authorName, <span class="hljs-function">(<span class="hljs-params">element</span>) =&gt;</span> element.textContent);
      <span class="hljs-keyword">const</span> title = <span class="hljs-keyword">await</span> reviewElement.$eval(selectors.reviewTitle, <span class="hljs-function">(<span class="hljs-params">element</span>) =&gt;</span> element.textContent);
      <span class="hljs-keyword">const</span> rawReviewDate = <span class="hljs-keyword">await</span> reviewElement.$eval(selectors.reviewDate, <span class="hljs-function">(<span class="hljs-params">element</span>) =&gt;</span> element.textContent);

      <span class="hljs-keyword">const</span> datePattern = <span class="hljs-regexp">/(\w+\s\d{1,2},\s\d{4})/</span>;
      <span class="hljs-keyword">const</span> match = rawReviewDate.match(datePattern);
      <span class="hljs-keyword">const</span> reviewDate = match ? match[<span class="hljs-number">0</span>].replace(<span class="hljs-string">','</span>, <span class="hljs-string">''</span>) : <span class="hljs-string">"Date not found"</span>;

      <span class="hljs-keyword">const</span> reviewData = {
        author,
        title,
        reviewDate,
      };

      reviewsData.push(reviewData);
    }
</code></pre>
<p>Great! You’ve successfully parsed the relevant data, which is now in JSON format, as shown below:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/10/Screenshot-2023-10-27-095917.png" alt="Image" width="600" height="400" loading="lazy">
<em>Scraped the data in JSON format</em></p>
<h2 id="heading-step-4-export-reviews-to-a-csv">Step 4: Export Reviews to a CSV</h2>
<p>You've parsed the reviews in JSON format, which is a bit human-readable. You can convert this data to CSV format to make it more readable and easier for other purposes. </p>
<p>There are many ways to convert JSON data to CSV, but we'll use a simple and effective approach. Here is a simple code snippet to convert JSON to CSV:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">let</span> csvContent = <span class="hljs-string">"Author,Title,Date\n
for (const review of reviewsData) {
      const { author, title, reviewDate } = review;
      csvContent += `${author},"</span>${title}<span class="hljs-string">",${reviewDate}\n`;
    }

const csvFileName = "</span>amazon_reviews.csv<span class="hljs-string">";
await fs.writeFileSync(csvFileName, csvContent, "</span>utf8<span class="hljs-string">");</span>
</code></pre>
<p>Here’s the output of the CSV file.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/10/Screenshot-2023-10-27-102705.png" alt="Image" width="600" height="400" loading="lazy">
<em>Converted JSON data into CSV format</em></p>
<p>And there you have it!</p>
<p>You can find the full Code uploaded on GitHub <a target="_blank" href="https://gist.github.com/triposat/20706d61989a4031669c2e3d25f487d0">here</a>.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this guide, you learned how to scrape Amazon product reviews behind a login using Puppeteer. You learned how to log in, parse relevant data, and save it to a CSV file. </p>
<p>To practice more, you can extract all the reviews of all the pages using pagination.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Web Scraping with Google Sheets – How to Scrape Web Pages with Built-in Functions ]]>
                </title>
                <description>
                    <![CDATA[ You read that right – you can practice web scraping without leaving your happy place: Google Sheets. Google Sheets has five built-in functions that help you import data from other sheets and other web pages. We'll walk through all of them in order fr... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/web-scraping-google-sheets/</link>
                <guid isPermaLink="false">66b8de2af805ffd579552e9e</guid>
                
                    <category>
                        <![CDATA[ google sheets ]]>
                    </category>
                
                    <category>
                        <![CDATA[ web scraping ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Eamonn Cottrell ]]>
                </dc:creator>
                <pubDate>Thu, 07 Sep 2023 21:14:07 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/09/5-functions-for-web-scraping-1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>You read that right – you can practice web scraping without leaving your happy place: Google Sheets.</p>
<p>Google Sheets has five built-in functions that help you import data from other sheets and other web pages. We'll walk through all of them in order from easiest (most limited) to hardest (most powerful).</p>
<p>Here they are, and you can click each function to skip down to its dedicated section. I've made a video as well that walks through everything:</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/Hx1Uepq3lLI" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<h3 id="heading-section-shortcuts">Section Shortcuts</h3>
<ul>
<li>How to use the <a class="post-section-overview" href="#">IMPORTRANGE</a> function</li>
<li>How to use the  <a class="post-section-overview" href="#-1">IMPORTDATA</a> function</li>
<li>How to use the <a class="post-section-overview" href="#-2">IMPORTFEED</a> function</li>
<li>How to use the <a class="post-section-overview" href="#-3">IMPORTHTML</a> function</li>
<li>How to use the <a class="post-section-overview" href="#-4">IMPORTXML</a> function</li>
</ul>
<p><a target="_blank" href="https://docs.google.com/spreadsheets/d/1n8CYEHYktePXJzt5quCBn2gwHvnvTH49vvJziXLnQSE/edit#gid=511198009">Here's the Google Sheet</a> we'll be using to demo each function.</p>
<p>If you'd like to edit it, make a copy by selecting File - Make a copy when you open it.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/image-1.png" alt="Image" width="600" height="400" loading="lazy">
<em>screenshot of Google Sheet</em></p>
<p><a id="importrange"></a></p>
<h2 id="heading-how-to-use-the-importrange-function">How to use the IMPORTRANGE function</h2>
<p>This is the only function that imports a range from another sheet rather than data from another web page. So, if you've got another Google Sheet, you can link the two sheets together and import the data you need from one sheet into the other sheet.</p>
<p>For instance, <a target="_blank" href="https://docs.google.com/spreadsheets/d/1S0H1FDHBC_7oxe2NCpnfuJcklaLpYCFuo_eRhADnyWg/edit#gid=1363138812">here's a sheet</a> with a bunch of random Samsung Galaxy data in it.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/image-2.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can see that we have a few hundred rows of data about phones. If we want to import this data into another spreadsheet, we can use <code>IMPORTRANGE()</code>. This is the simplest to use of the five functions we'll look at. All it needs is a URL for a Google Sheet and the range we want to import.</p>
<p>Check out the tab for IMPORTRANGE in the Google Sheet <a target="_blank" href="https://docs.google.com/spreadsheets/d/1n8CYEHYktePXJzt5quCBn2gwHvnvTH49vvJziXLnQSE/edit#gid=0">here</a>, and you'll see that in cell <code>A5</code>, we've got the function <code>=IMPORTRANGE(B4,"data!a1:K")</code>. This is pulling in the range <code>A1:K</code> from the <code>data</code> tab of our second spreadsheet whose URL is in cell <code>B4</code>.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/image-3.png" alt="Image" width="600" height="400" loading="lazy">
<em>screenshot of IMPORTRANGE function</em></p>
<p>Once your data is pulled into your spreadsheet, you can do one of two things. </p>
<ol>
<li>Leave it linked through the <code>IMPORTRANGE</code> function. This way, if your data source is going to be updated, you'll pull in the updated data.</li>
<li>Copy and CTRL+SHIFT+V to paste values only. This way, you have the raw data in your new spreadsheet and you won't have to be dependent on something changing with the URL down the road.</li>
</ol>
<p><a id="importdata"></a></p>
<h2 id="heading-how-to-use-the-importdata-function">How to use the IMPORTDATA function</h2>
<p>This is pretty straightforward. It'll import .csv or .tsv data from anywhere on the internet. These stand for Comma Separated Values and Tab Separated Values. </p>
<p>.csv is the most commonly used file type for financial data that needs to be imported into spreadsheets and other programs. </p>
<p>Like <code>IMPORTRANGE</code>, we only need a couple pieces of information for <code>IMPORTDATA</code>: the URL where the file lives, and the delimiter. There's also an optional variable for locale, but I found that it was unnecessary.</p>
<p>In fact, Google Sheets is pretty smart – you can leave off the delimiter too, and it will usually decipher what type of data (.csv or .tsv) lives at the URL.</p>
<p>You can see that I've found a New York government data website where there lives some winning lottery number data. I've put the URL for a .csv file in <code>A5</code>, and then used the function <code>=IMPORTDATA(A5,",")</code> to pull in the data from the .csv file.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/image-4.png" alt="Image" width="600" height="400" loading="lazy">
<em>Screenshot of IMPORTDATA function</em></p>
<p>You could alternatively download the .csv file and then select File - Import to bring in this data. But in the event that you do not have download permissions or simply want to get it straight from a site, <code>IMPORTDATA</code> works great.</p>
<p><a id="importfeed"></a></p>
<h2 id="heading-how-to-use-the-importfeed-function">How to use the IMPORTFEED function</h2>
<p>This imports RSS feed data. If you're familiar with podcasting, you may recognize the term. Every podcast has an RSS feed which is a structured file full of XML data. </p>
<p>Using the URL for the RSS feed, IMPORTFEED will pull in data about a podcast, news article, or blog from its RSS information.</p>
<p>This is the first function that begins to have a few more options at its disposal, too.</p>
<p>All that's required is the URL of a feed, and it'll bring in data from that feed. However, we can specify a few other parameters if we like. The options include:</p>
<ol>
<li>[query]: this specifies which pieces of data to pull from the feed. We can select from options like "feed " where type can be title, description, author or URL. Same deal with "items " where type can be title, summary, URL or created.</li>
<li>[headers]: this will either bring in headers (TRUE) or not (FALSE)</li>
<li>[num_items]: this will specify how many items to return when using Query. (The docs state that if this isn't specified, all items currently published are returned, but I did not find this to be the case. I had to specify a larger number to get back more than a dozen or so).</li>
</ol>
<p>You can see from the screenshots below that I am querying one of my feeds to pull in the episode titles and URLs. </p>
<p>First, to get all the titles, I used <code>IMPORTFEED(A3, "items title", TRUE, 50</code>:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/image-6.png" alt="Image" width="600" height="400" loading="lazy">
<em>Screenshot of IMPORTFEED</em></p>
<p>Then, similarly for the URLs, I used <code>IMPORTFEED(A3, "items url", TRUE, 50)</code>:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/image-7.png" alt="Image" width="600" height="400" loading="lazy">
<em>Screenshot of IMPORTFEED #2</em></p>
<p><a id="importhtml"></a></p>
<h2 id="heading-how-to-use-the-importhtml-function">How to use the IMPORTHTML function</h2>
<p>Now we're getting into scraping data straight off of a web site. This will take a URL and then a query parameter where we specify to look for either a "table" or a "list".</p>
<p>This is followed by an index value representing which table or list to look for if there are multiple on the page. It is zero indexed, so input zero if you're looking for the first one.</p>
<p>IMPORTHTML looks through the HTML code on a website for <code>&lt;table&gt;</code> and <code>&lt;list&gt;</code> HTML elements.</p>
<pre><code class="lang-html"><span class="hljs-comment">&lt;!--Here's what a simple table looks like:--&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">table</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">thead</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">tr</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">th</span>&gt;</span>table header 1<span class="hljs-tag">&lt;/<span class="hljs-name">th</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">th</span>&gt;</span>table header 2<span class="hljs-tag">&lt;/<span class="hljs-name">th</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">tr</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">thead</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">tbody</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">tr</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">td</span>&gt;</span>table data row 1 cell1<span class="hljs-tag">&lt;/<span class="hljs-name">td</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">td</span>&gt;</span>table data row 1 cell2<span class="hljs-tag">&lt;/<span class="hljs-name">td</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">tr</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">tr</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">td</span>&gt;</span>table data row 2 cell1<span class="hljs-tag">&lt;/<span class="hljs-name">td</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">td</span>&gt;</span>table data row 2 cell2<span class="hljs-tag">&lt;/<span class="hljs-name">td</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">tr</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">tbody</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">table</span>&gt;</span>

<span class="hljs-comment">&lt;!--Here's what an ordered list looks like:--&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">ol</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">li</span>&gt;</span>ordered item 1<span class="hljs-tag">&lt;/<span class="hljs-name">li</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">li</span>&gt;</span>ordered item 2<span class="hljs-tag">&lt;/<span class="hljs-name">li</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">li</span>&gt;</span>ordered item 2<span class="hljs-tag">&lt;/<span class="hljs-name">li</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">ol</span>&gt;</span>
<span class="hljs-comment">&lt;!--Here's what an unordered list looks like:--&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">ul</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">li</span>&gt;</span>unordered item 1<span class="hljs-tag">&lt;/<span class="hljs-name">li</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">li</span>&gt;</span>unordered item 2<span class="hljs-tag">&lt;/<span class="hljs-name">li</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">li</span>&gt;</span>unordered item 3<span class="hljs-tag">&lt;/<span class="hljs-name">li</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">ul</span>&gt;</span>
</code></pre>
<p>In the sample sheet, I've got the URL for some stats about the Barkley Marathons in cell <code>B3</code> and am then referencing that in <code>A4</code>'s function: <code>=IMPORTHTML(B3,"table",0)</code>.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/image-8.png" alt="Image" width="600" height="400" loading="lazy">
<em>Screenshot of IMPORTHTML</em></p>
<p>FYI, freeCodeCamp created <a target="_blank" href="https://scrapepark.org/">ScrapePark</a> as a place to practice web scraping, so you can use it for <code>IMPORTHTML</code> and <code>IMPORTXML</code> coming up next👇.</p>
<p><a id="importxml"></a></p>
<h2 id="heading-how-to-use-the-importxml-function">How to use the IMPORTXML function</h2>
<p>We saved the best for last. This will look through websites and scrape darn near anything we want it too. It's complicated, though, because instead of importing all the table or list data like with <code>IMPORTHTML</code>, we write our queries using what's called XPath. </p>
<p>XPath is an expression language itself used to query XML documents. We can write XPath expressions to have <code>IMPORTXML</code> scrape all kinds of things from an HTML page.</p>
<p>There are many resources to find the proper XPath expressions. <a target="_blank" href="https://devhints.io/xpath">Here's one</a> that I used for this project.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-182.png" alt="Image" width="600" height="400" loading="lazy">
<em>screenshot of XPath cheat sheet</em></p>
<p>In the <a target="_blank" href="https://docs.google.com/spreadsheets/d/1n8CYEHYktePXJzt5quCBn2gwHvnvTH49vvJziXLnQSE/edit#gid=438611895">sheet</a> for <code>IMPORTHTML</code>, I have several examples that I encourage you to click through and check out.</p>
<p>For example, using the function <code>=IMPORTXML(A11,"//*[@class='post-card-title']")</code> allows us to bring in all the titles of my articles because from inspecting the HTML on my author page here, I found that they all have the class <code>post-card-title</code>.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/class.png" alt="Image" width="600" height="400" loading="lazy">
<em>screenshot of inspecting a web page with dev tools</em></p>
<p>In the same way, we can use the function <code>=IMPORTXML(A11,"//*[@class='post-card-title']//a/@href")</code> to grab the URL slug of each of those articles.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/image-10.png" alt="Image" width="600" height="400" loading="lazy">
<em>screenshot of IMPORTXML</em></p>
<p>You'll notice that it does bring in the full URL, so as a bonus, we can simply prepend the domain to each of these. Here's the function for the first row which we can drag down to get all those proper URLs: <code>="https://www.freecodecamp.org"&amp;B13</code></p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/image-11.png" alt="Image" width="600" height="400" loading="lazy">
<em>screenshot of prepending domain to slug</em></p>
<h2 id="heading-follow-me">Follow Me</h2>
<p>I hope this was helpful for you! I learned a lot myself, and enjoyed putting the video together. </p>
<p>You can find me on YouTube: <a target="_blank" href="https://www.youtube.com/@eamonncottrell">https://www.youtube.com/@eamonncottrell</a></p>
<p>And, I've got a newsletter here: <a target="_blank" href="https://got-sheet.beehiiv.com/">https://got-sheet.beehiiv.com/</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Introducing ScrapePark.org – Practice Web Scraping Without Hurting Anyone ]]>
                </title>
                <description>
                    <![CDATA[ When I grew up in the 1990s, we skateboarded everywhere. If a parking lot had a ledge or a handrail, you'd better believe we waxed it and grinded our boards across it. Fast forward to the 2020s – many cities have built public skate parks. These are s... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/practice-web-scraping-safely/</link>
                <guid isPermaLink="false">66b8d5331a59d9c56a518bf7</guid>
                
                    <category>
                        <![CDATA[ beautiful soup ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ web scraping ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Quincy Larson ]]>
                </dc:creator>
                <pubDate>Mon, 21 Aug 2023 19:50:15 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/08/shawn-henry-eK_aInAXydw-unsplash--1-.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When I grew up in the 1990s, we skateboarded everywhere. If a parking lot had a ledge or a handrail, you'd better believe we waxed it and grinded our boards across it.</p>
<p>Fast forward to the 2020s – many cities have built public skate parks. These are safe, legal places where we can skateboard without getting hassled by business owners or the cops.</p>
<p>As a dad in his 40s, I'm not scraping concrete with a skateboard much these days. Instead, I'm scraping websites to gather data I can analyze or feed into a machine learning algorithm.</p>
<p>But scraping the web can sometimes overload the websites you're scraping. Even though it may not violate a website's Terms of Service, you should be very gentle when doing it.</p>
<p>If you scrape a website that's not prepared, or if your scraping code is inefficient, you might bring the entire website down.</p>
<p>There are few things more embarrassing than accidentally DDoS'ing a friendly website that's just trying to serve visitors on the open web.</p>
<p>That's why – in the spirit of public skate parks – <a target="_blank" href="https://www.scrapepark.org">freeCodeCamp created ScrapePark.org</a>. Anyone can go there and practice web scraping techniques without worrying about hurting anyone.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/ScrapePark_org_--.png" alt="Image" width="600" height="400" loading="lazy">
<em>ScrapePark.org's landing page where you can practice web scraping techniques</em></p>
<p>ScrapePark.org is a simple E-commerce-style page that we've built specifically to stand up to heavy traffic loads.</p>
<p>You can practice scraping:</p>
<ul>
<li>tables</li>
<li>iframes</li>
<li>dropdown menus</li>
<li>links</li>
<li>lists</li>
<li>images</li>
<li>buttons</li>
<li>forms</li>
<li>image carousels</li>
<li>menu items</li>
<li>navigation bar items</li>
<li>and more</li>
</ul>
<p><a target="_blank" href="https://github.com/freeCodeCamp/scrapepark.org">The entire project is open source</a>. If you want to add some additional elements or pages to it, please be our guest.</p>
<p>We'd love for this to become the main place that people practice their scraping, so that we can spare those "mom &amp; pop" websites from getting overloaded.</p>
<p>Hone your web scraping skills on ScrapePark.org so you can then use them more responsibly on the open web.</p>
<p>This is just one of many initiatives the freeCodeCamp community has been working on this year. We have a lot more cool projects in the works.</p>
<p>A huge thanks to freeCodeCamp teachers <a target="_blank" href="https://github.com/estefaniacn">Estefania Cassingena Navone</a> and <a target="_blank" href="https://github.com/GEJ1">Gustavo Juantorena</a> for helping develop ScrapePark.org. They just published a Spanish-language course focused on Web Scraping. If you speak Spanish, check out the course. [2 hour watch]:</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/yKi9-BfbfzQ" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<p>If you want an English-language scraping course, freeCodeCamp has you covered. Try searching "scraping course" on Google or YouTube and look for the good ones from freeCodeCamp. 😃</p>
<p>Happy scraping.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Scrape Multiple Web Pages Using Python ]]>
                </title>
                <description>
                    <![CDATA[ By Shittu Olumide Data is all around us. Every website you visit includes data in a readable format that you can utilize for a project.  And although you can easily copy and paste the data, the best approach for big amounts of data is to perform web ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-scrape-multiple-web-pages-using-python/</link>
                <guid isPermaLink="false">66d4610a37bd2215d1e245d7</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ web scraping ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 14 Feb 2023 19:48:00 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/02/How-to-Scrape-Multiple-Web-Pages-Using-Python-1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Shittu Olumide</p>
<p>Data is all around us. Every website you visit includes data in a readable format that you can utilize for a project. </p>
<p>And although you can easily copy and paste the data, the best approach for big amounts of data is to perform web scraping.</p>
<p>Learning web scraping can be tricky at first, but with a good web scraping library, things will become much easier. </p>
<p>Web scraping can be a useful tool for gathering data and information, but it is important to ensure that you do it in a safe and legal manner. </p>
<p>Here are some tips for performing web scraping properly:</p>
<ul>
<li>Seek permission before you scrape a site.</li>
<li>Read and understand the website's terms of service and robots.txt file.</li>
<li>Limit the frequency of your scraping.</li>
<li>Use web scraping tools that respect website owners' terms of service.</li>
</ul>
<p>Now that you understand the proper way to approach scraping, let's dive in. In this step-by-step tutorial, we will walk through how to scrape several pages of a website using Python's most user-friendly web scraping module, Beautiful Soup.</p>
<p>This tutorial will be divided into two portions: we will scrape a single page in the first phase. Then in the second section, we'll scrape several pages based on the code used in the first section.</p>
<h2 id="heading-requirements">Requirements</h2>
<p><strong>Python 3</strong>: you'll need to use Python 3 for this tutorial, because the library that we'll use is a Python library. To download and install Python check out the official <a target="_blank" href="https://www.python.org/downloads/">website</a>.</p>
<p><strong>Beautiful Soup</strong>: Beautiful Soup is a Python package for structured data parsing. For parsed pages, it generates a parse tree that you can use to extract data from HTML. It lets you interact with HTML in the same way you can interact with a web page using developer tools. </p>
<p>To begin using it, launch your terminal and install Beautiful Soup:</p>
<pre><code class="lang-bash">pip install beautifulsoup4
</code></pre>
<p><strong>Requests library</strong>: The <a target="_blank" href="https://pypi.org/project/requests/">requests library</a> is the Python standard for making HTTP requests. We'll use this in conjunction with Beautiful Soup to obtain the HTML for a website.</p>
<pre><code class="lang-bash">pip install requests
</code></pre>
<p><strong>Install a parser</strong>: To extract data from HTML text, we need a parser. We'll utilize the <code>lxml</code> parser here. To install this parser, execute the following command:</p>
<pre><code class="lang-bash">pip install lxml
</code></pre>
<p><strong>Note</strong>: You don't have to be a Python professional to follow this tutorial.</p>
<h2 id="heading-how-to-scrape-a-single-web-page">How to Scrape a Single Web Page</h2>
<p>As I explained earlier, we will start by understanding how to scrape a single web page. Then we'll move on to scraping multiple web pages. </p>
<p>Let's build our first scraper.</p>
<h3 id="heading-import-the-libraries">Import the libraries</h3>
<p>First, let's import the libraries we'll need:</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup
</code></pre>
<h3 id="heading-get-the-website-html">Get the website HTML</h3>
<p>We want to scrape a website with hundreds of pages of movie transcripts. We'll begin by scraping a single page, and then demonstrate how to scrape multiple pages.  </p>
<p>First, we'll define the connection. In this example, we'll use the Titanic movie transcript, but you can select any movie you wish. </p>
<p>Then we make a <code>request</code> to the website and receive a response, which we record in the result variable. Following that, we'll use the <code>.text</code> method to retrieve the website's content. </p>
<p>Finally, we'll use the <code>lxml</code> parser to get the <code>soup</code>, which is the object containing all of the data in the nested structure that we'll reuse later.</p>
<pre><code class="lang-py">website = <span class="hljs-string">'https://subslikescript.com/movie/Titanic-120338'</span>

result = requests.get(website)
content = result.text
soup = BeautifulSoup(content, <span class="hljs-string">'lxml'</span>)

print(soup.prettify())
</code></pre>
<p>Once we have the <code>soup</code> object, we can simply get readable HTML by using <code>.prettify()</code>. Although we may use the HTML printed in a text editor to find elements, it is far easier to go straight to the HTML code of the element we seek. We'll do this in the following phase. </p>
<h3 id="heading-examine-the-webpage-and-html-code">Examine the webpage and HTML code</h3>
<p>Before we start writing code, we must first assess the website we want to scrape and the HTML code we got to identify the best strategy to scrape the website. A sample transcript is available below. The things to be scraped are the movie title and transcript.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/Screen-Shot-2023-02-10-at-16.45.28.png" alt="Image" width="600" height="400" loading="lazy">
<em>Image showing the title and transcript of the titanic movie.</em></p>
<p>To get the HTML code for a given element, perform the following steps:</p>
<ol>
<li>Navigate to the Titanic transcript's website.</li>
<li>Right-click on either the movie title or the transcript. You'll see a list. Select "Inspect" to view the page's source code.</li>
</ol>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/Screen-Shot-2023-02-10-at-17.00.05.png" alt="Image" width="600" height="400" loading="lazy">
<em>Image showing page source code</em></p>
<h3 id="heading-how-to-find-an-element-with-beautiful-soup">How to find an element with Beautiful Soup</h3>
<p>It's easy to find an element in Beautiful Soup. Simply apply the <code>.find()</code> method to the previously prepared soup.</p>
<p>As an example, find the box containing the movie title, description, and transcript. It's within an <code>article</code> tag and has the class <code>main-article</code> on it. We can use the following code to find that box:</p>
<pre><code class="lang-py">box = soup.find(<span class="hljs-string">'article'</span>, class_=<span class="hljs-string">'main-article'</span>)
</code></pre>
<p>The movie title is enclosed in an <code>h1</code> tag and lacks a class name. After we find it, we use the <code>.get_text()</code> function to retrieve the text within the node:</p>
<pre><code class="lang-py">title = box.find(<span class="hljs-string">'h1'</span>).get_text()
</code></pre>
<p>The transcript is included within a <code>div</code> tag and has the class <code>full-script</code>. In this scenario, we'll change the default arguments within the <code>.get_text()</code> function to get the text. </p>
<p>We begin by setting <code>strip=True</code> to eliminate leading and trailing spaces. Then we add a blank space to the separator <code>separator=' '</code> to ensure that words have a blank space after each new line <code>\n</code>.</p>
<pre><code class="lang-py">transcript = box.find(<span class="hljs-string">'div'</span>, class_=<span class="hljs-string">'full-script'</span>)
transcript = transcript.get_text(strip=<span class="hljs-literal">True</span>, separator=<span class="hljs-string">' '</span>)
</code></pre>
<p>So far, we've scraped the data successfully. Print the <code>title</code> and <code>transcript</code> variables to ensure that everything is operating properly.</p>
<h3 id="heading-how-to-export-data-into-a-txt-file">How to export data into a .txt file</h3>
<p>You can store data in <code>CSV</code>, <code>JSON</code>, and other formats. In this example, we'll save the extracted data in a.txt file. To accomplish this, we will use the <code>with</code> keyword, as shown in the code below:</p>
<pre><code class="lang-py"><span class="hljs-keyword">with</span> open(<span class="hljs-string">f'<span class="hljs-subst">{title}</span>.txt'</span>, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> file:
    file.write(transcript)
</code></pre>
<p>Remember to use the <code>f</code>-string to set the file name as the movie title. After running the code, we should have a <code>.txt</code> file in our working directory.</p>
<p>We're ready to scrape transcripts from multiple pages now that we've successfully scraped data from one web page!</p>
<h2 id="heading-how-to-scrape-multiple-web-pages">How to Scrape Multiple Web Pages</h2>
<p>On the transcript page, scroll down and click on the <a target="_blank" href="https://subslikescript.com/movies">all movie scripts</a>. You can find it at the bottom of the web page.  </p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/Screen-Shot-2023-02-11-at-07.59.55.png" alt="Image" width="600" height="400" loading="lazy">
<em>All transcripts page</em></p>
<p>The screenshot shows all of the movie transcripts. The website has 1,757 pages, with approximately 30 movie transcripts on each page. </p>
<p>In this section, we will scrape multiple links by obtaining the <code>href</code> attribute of each link. First, we must modify the website to allow scrapin. Our new website variable will be as follows:</p>
<pre><code class="lang-py">root = <span class="hljs-string">'https://subslikescript.com'</span>
website = <span class="hljs-string">f'<span class="hljs-subst">{root}</span>/movies'</span>
</code></pre>
<p>The main reason why a <code>root</code> variable is defined in the code is to help scrape multiple web pages later.</p>
<h3 id="heading-how-to-get-the-href-attribute">How to get the href attribute</h3>
<p>Let's start with the <code>href</code> attribute of the 30 movies on one page. Examine any movie title within the "List of Movie Transcripts" box. </p>
<p>Following that, we should have the HTML code. An <code>a</code> tag should be highlighted in blue. Each <code>a</code> tag belongs to a movie title.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/Screen-Shot-2023-02-11-at-13.04.17.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>As we can see, the links within the <code>href</code> do not include the root domain subslikescript.com. This is why we created a root variable before concatenating it.</p>
<p>Let's look for all of the <code>a</code> elements on the page.</p>
<h3 id="heading-how-to-find-multiple-elements">How to find multiple elements</h3>
<p>In Beautiful Soup, we use the <code>.find_all()</code> method to locate multiple elements. To extract the link that corresponds to each movie transcript, we must include the parameter <code>href=True</code>.</p>
<pre><code class="lang-py">box.find_all(<span class="hljs-string">'a'</span>, href=<span class="hljs-literal">True</span>)
</code></pre>
<p>To get the links from the href, add <code>['href']</code> to the expression above. However, because the <code>.find_all()</code> method returns a list, we must loop through it and get the <code>hrefs</code> one by one within the loop.</p>
<pre><code class="lang-py"><span class="hljs-keyword">for</span> link <span class="hljs-keyword">in</span> box.find_all(<span class="hljs-string">'a'</span>, href=<span class="hljs-literal">True</span>):
    link[<span class="hljs-string">'href'</span>]
</code></pre>
<p>We can use list comprehension to save the links, as shown below:</p>
<pre><code class="lang-py">links = [link[<span class="hljs-string">'href'</span>] <span class="hljs-keyword">for</span> link <span class="hljs-keyword">in</span> box.find_all(<span class="hljs-string">'a'</span>, href=<span class="hljs-literal">True</span>)]
print(links)
</code></pre>
<p>The links we want to scrape will be visible if you print the links list. In the following step, we'll scrape each page.</p>
<h3 id="heading-how-to-loop-through-each-link">How to loop through each link</h3>
<p>To scrape the transcript of each link, we'll repeat the steps we used for the first transcript. This time, we'll put those steps inside the <code>for</code> loop below.</p>
<pre><code class="lang-py"><span class="hljs-keyword">for</span> link <span class="hljs-keyword">in</span> links:
    result = requests.get(<span class="hljs-string">f'<span class="hljs-subst">{root}</span>/<span class="hljs-subst">{link}</span>'</span>)
    content = result.text
    soup = BeautifulSoup(content, <span class="hljs-string">'lxml'</span>)
</code></pre>
<p>As you may recall, the links we previously saved did not include the root <code>subslikescript.com</code>, so we must concatenate it with the expression <code>f'{root}/{link}'</code>.</p>
<p>The rest of the code is identical to what we wrote in the first section of this guide.</p>
<h2 id="heading-wrapping-up">Wrapping up</h2>
<p>If you want to browse through the web pages, you have two options.</p>
<ul>
<li>Check any of the pages that are visible on the webpage (for example, 1, 2, 3, or 1757). Get the <code>a</code> tag with the <code>href</code> attribute along with the links for each page. When you have the links, combine them with the root and proceed as described in Section 2 after doing so.</li>
<li>Visit page 2 and copy the link you see there. This is how it ought to appear: <code>subslikescript.com/movies?page=2</code>. You can see that the website has a consistent format for each page: <code>f'{website}?page={i}'</code>. If you want to go through the first ten pages, you can reuse the website variable and loop between 1 and 10.</li>
</ul>
<p>Lets connect on <a target="_blank" href="https://www.twitter.com/Shittu_Olumide_">Twitter</a> and on <a target="_blank" href="https://www.freecodecamp.org/news/p/596c046e-0ba5-4a99-bf4d-eb3e0bebe75c/linkedin.com/in/olumide-shittu">LinkedIn</a>. You can also subscribe to my <a target="_blank" href="https://www.youtube.com/channel/UCNhFxpk6hGt5uMCKXq0Jl8A">YouTube</a> channel.</p>
<p>Happy Coding!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Web Scraping in JavaScript – How to Use Puppeteer to Scrape Web Pages ]]>
                </title>
                <description>
                    <![CDATA[ Welcome to the world of web scraping! Have you ever needed data from a website but found it hard to access it in a structured format? This is where web scraping comes in. Using scripts, we can extract the data we need from a website for various purpo... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/web-scraping-in-javascript-with-puppeteer/</link>
                <guid isPermaLink="false">66bb921e0eaca026d8cfa5ed</guid>
                
                    <category>
                        <![CDATA[ JavaScript ]]>
                    </category>
                
                    <category>
                        <![CDATA[ puppeteer ]]>
                    </category>
                
                    <category>
                        <![CDATA[ web scraping ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Gaël Thomas ]]>
                </dc:creator>
                <pubDate>Tue, 31 Jan 2023 15:26:55 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/01/web-scraping-in-javascript-with-puppeteer.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Welcome to the world of web scraping! Have you ever needed data from a website but found it hard to access it in a structured format? This is where web scraping comes in.</p>
<p>Using scripts, we can extract the data we need from a website for various purposes, such as creating databases, doing some analytics, and even more.</p>
<blockquote>
<p><strong>Disclaimer:</strong> Be careful when doing web scraping. Always make sure you're scraping sites that allow it, and performing this activity within ethical and legal limits.</p>
</blockquote>
<p>JavaScript and Node.js offers various libraries that make web scraping easier. For simple data extraction, you can use Axios to fetch an API responses or a website HTML. </p>
<p>But if you're looking to do more advanced tasks including automations, you'll need libraries such as <a target="_blank" href="https://pptr.dev/">Puppeteer</a>, <a target="_blank" href="https://cheerio.js.org/">Cheerio</a>, or <a target="_blank" href="https://github.com/segmentio/nightmare">Nightmare</a> (don't worry the name is nightmare, but it's not that bad to use 😆).</p>
<p>I'll introduce the basics of web scraping in JavaScript and Node.js using Puppeteer in this article. I structured the writing to show you some basics of fetching information on a website and clicking a button (for example, moving to the next page).</p>
<p>At the end of this introduction, I'll recommend ways to practice and learn more by improving the project we just created.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before diving in and scraping our first page together using JavaScript, Node.js, and the <a target="_blank" href="https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction">HTML DOM</a>, I'd recommend having a basic understanding of these technologies. It'll improve your learning and understanding of the topic.</p>
<p>Let's dive in! 🤿</p>
<h2 id="heading-how-to-initialize-your-first-puppeteer-scraper">How to Initialize Your First Puppeteer Scraper</h2>
<p>New project...new folder! First, create the <code>first-puppeteer-scraper-example</code> folder on your computer. It'll contain the code of our future scraper.</p>
<pre><code class="lang-shell">mkdir first-puppeteer-scraper-example
</code></pre>
<p>Now, it's time to initialize your Node.js repository with a package.json file. It's helpful to add information to the repository and NPM packages, such as the Puppeteer library.</p>
<pre><code class="lang-shell">npm init -y
</code></pre>
<p>After typing this command, you should find this <code>package.json</code> file in your repository tree.</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"name"</span>: <span class="hljs-string">"first-puppeteer-scraper-example"</span>,
  <span class="hljs-attr">"version"</span>: <span class="hljs-string">"1.0.0"</span>,
  <span class="hljs-attr">"main"</span>: <span class="hljs-string">"index.js"</span>,
  <span class="hljs-attr">"scripts"</span>: {
    <span class="hljs-attr">"test"</span>: <span class="hljs-string">"echo \"Error: no test specified\" &amp;&amp; exit 1"</span>
  },
  <span class="hljs-attr">"keywords"</span>: [],
  <span class="hljs-attr">"author"</span>: <span class="hljs-string">""</span>,
  <span class="hljs-attr">"license"</span>: <span class="hljs-string">"ISC"</span>,
  <span class="hljs-attr">"dependencies"</span>: {
    <span class="hljs-attr">"puppeteer"</span>: <span class="hljs-string">"^19.6.2"</span>
  },
  <span class="hljs-attr">"type"</span>: <span class="hljs-string">"module"</span>,
  <span class="hljs-attr">"devDependencies"</span>: {},
  <span class="hljs-attr">"description"</span>: <span class="hljs-string">""</span>
}
</code></pre>
<p>Before proceeding, we must ensure the project is configured to handle ES6 features. To do so, you can add the <code>"types": "module"</code> instruction at the end of the configuration.</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"name"</span>: <span class="hljs-string">"first-puppeteer-scraper-example"</span>,
  <span class="hljs-attr">"version"</span>: <span class="hljs-string">"1.0.0"</span>,
  <span class="hljs-attr">"main"</span>: <span class="hljs-string">"index.js"</span>,
  <span class="hljs-attr">"scripts"</span>: {
    <span class="hljs-attr">"test"</span>: <span class="hljs-string">"echo \"Error: no test specified\" &amp;&amp; exit 1"</span>
  },
  <span class="hljs-attr">"keywords"</span>: [],
  <span class="hljs-attr">"author"</span>: <span class="hljs-string">""</span>,
  <span class="hljs-attr">"license"</span>: <span class="hljs-string">"ISC"</span>,
  <span class="hljs-attr">"dependencies"</span>: {
    <span class="hljs-attr">"puppeteer"</span>: <span class="hljs-string">"^19.6.2"</span>
  },
  <span class="hljs-attr">"type"</span>: <span class="hljs-string">"module"</span>,
  <span class="hljs-attr">"description"</span>: <span class="hljs-string">""</span>,
  <span class="hljs-attr">"types"</span>: <span class="hljs-string">"module"</span>
}
</code></pre>
<p>The last step of our scraper initialization is to install the Puppeteer library. Here's how:</p>
<pre><code class="lang-shell">npm install puppeteer
</code></pre>
<p>Wow! We're there – we're ready to scrape our first website together. 🤩</p>
<h2 id="heading-how-to-scrape-your-first-piece-of-data">How to Scrape Your First Piece of Data</h2>
<p>In this article, we'll use the <a target="_blank" href="https://toscrape.com/">ToScrape</a> website as our learning platform. This online sandbox provides two projects specifically designed for web scraping, making it a great starting point to learn the basics such as data extraction and page navigation.</p>
<p>For this beginner's introduction, we'll specifically focus on the <a target="_blank" href="http://quotes.toscrape.com/">Quotes to Scrape</a> website.</p>
<h3 id="heading-how-to-initialize-the-script">How to Initialize the Script</h3>
<p>In the project repository root, you can create an <code>index.js</code> file. This will be our application entry point.</p>
<p>To keep it simple, our script consists of one function in charge of getting the website's quotes (<code>getQuotes</code>).</p>
<p>In the function's body, we will need to follow different steps:</p>
<ul>
<li>Start a Puppeteer session with <code>puppeteer.launch</code> (it'll instantiate a <code>browser</code> variable that we'll use for manipulating the browser)</li>
<li>Open a new page/tab with <code>browser.newPage</code> (it'll instantiate a <code>page</code> variable that we'll use for manipulating the page)</li>
<li>Change the URL of our new page to <a target="_blank" href="http://quotes.toscrape.com/"><code>http://quotes.toscrape.com/</code></a> with <code>page.goto</code></li>
</ul>
<p>Here's the commented version of the initial script:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">import</span> puppeteer <span class="hljs-keyword">from</span> <span class="hljs-string">"puppeteer"</span>;

<span class="hljs-keyword">const</span> getQuotes = <span class="hljs-keyword">async</span> () =&gt; {
  <span class="hljs-comment">// Start a Puppeteer session with:</span>
  <span class="hljs-comment">// - a visible browser (`headless: false` - easier to debug because you'll see the browser in action)</span>
  <span class="hljs-comment">// - no default viewport (`defaultViewport: null` - website page will in full width and height)</span>
  <span class="hljs-keyword">const</span> browser = <span class="hljs-keyword">await</span> puppeteer.launch({
    <span class="hljs-attr">headless</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-attr">defaultViewport</span>: <span class="hljs-literal">null</span>,
  });

  <span class="hljs-comment">// Open a new page</span>
  <span class="hljs-keyword">const</span> page = <span class="hljs-keyword">await</span> browser.newPage();

  <span class="hljs-comment">// On this new page:</span>
  <span class="hljs-comment">// - open the "http://quotes.toscrape.com/" website</span>
  <span class="hljs-comment">// - wait until the dom content is loaded (HTML is ready)</span>
  <span class="hljs-keyword">await</span> page.goto(<span class="hljs-string">"http://quotes.toscrape.com/"</span>, {
    <span class="hljs-attr">waitUntil</span>: <span class="hljs-string">"domcontentloaded"</span>,
  });
};

<span class="hljs-comment">// Start the scraping</span>
getQuotes();
</code></pre>
<p>What do you think of running our scraper and seeing the output? Let's do it with the command below:</p>
<pre><code class="lang-shell">node index.js
</code></pre>
<p>After doing this, you should have a brand new browser application started with a new page and the website Quotes to Scrape loaded onto it. Magic, isn't it? 🪄</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/image-353.png" alt="Image" width="600" height="400" loading="lazy">
<em>Quotes to Scrape homepage loaded by our initial script</em></p>
<p><strong>Note:</strong> For this first iteration, we're not closing the browser. This means you will need to close the browser to stop the running application.</p>
<h3 id="heading-how-to-fetch-the-first-quote">How to Fetch the First Quote</h3>
<p>Whenever you want to scrape a website, you'll have to play with the HTML DOM. What I recommend is to inspect the page and start navigating the different elements to find what you need.</p>
<p>In our case, we'll follow the <a target="_blank" href="https://dictionary.cambridge.org/dictionary/english/baby-step">baby step principle</a> and start fetching the first quote, author, and text.</p>
<p>After browsing the page HTML, we can notice a quote is encapsulated in a <code>&lt;div&gt;</code> element with a class name <code>quote</code> (<code>class="quote"</code>). This is important information because the scraping works with <a target="_blank" href="https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors">CSS selectors</a> (for example, .quote).</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/image-354.png" alt="Image" width="600" height="400" loading="lazy">
<em>Browser inspector with the first quote <code>&amp;lt;div&amp;gt;</code> selected</em></p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/image-355.png" alt="Image" width="600" height="400" loading="lazy">
<em>An example of how each quote is rendered in the HTML</em></p>
<p>Now that we have this knowledge, we can return to our <code>getQuotes</code> function and improve our code to select the first quote and extract its data.</p>
<p>We will need to add the following after the <code>page.goto</code> instruction:</p>
<ul>
<li>Extract data from our page HTML with <code>page.evaluate</code> (it'll execute the function passed as a parameter in the page context and returns the result)</li>
<li>Get the quote HTML node with <code>document.querySelector</code> (it'll fetch the first <code>&lt;div&gt;</code> with the classname <code>quote</code> and returns it)</li>
<li>Get the quote text and author from the previously extracted quote HTML node with <code>quote.querySelector</code> (it'll extract the elements with the classname <code>text</code> and <code>author</code> under <code>&lt;div class="quote"&gt;</code> and returns them)</li>
</ul>
<p>Here's the updated version with detailed comments:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">import</span> puppeteer <span class="hljs-keyword">from</span> <span class="hljs-string">"puppeteer"</span>;

<span class="hljs-keyword">const</span> getQuotes = <span class="hljs-keyword">async</span> () =&gt; {
  <span class="hljs-comment">// Start a Puppeteer session with:</span>
  <span class="hljs-comment">// - a visible browser (`headless: false` - easier to debug because you'll see the browser in action)</span>
  <span class="hljs-comment">// - no default viewport (`defaultViewport: null` - website page will in full width and height)</span>
  <span class="hljs-keyword">const</span> browser = <span class="hljs-keyword">await</span> puppeteer.launch({
    <span class="hljs-attr">headless</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-attr">defaultViewport</span>: <span class="hljs-literal">null</span>,
  });

  <span class="hljs-comment">// Open a new page</span>
  <span class="hljs-keyword">const</span> page = <span class="hljs-keyword">await</span> browser.newPage();

  <span class="hljs-comment">// On this new page:</span>
  <span class="hljs-comment">// - open the "http://quotes.toscrape.com/" website</span>
  <span class="hljs-comment">// - wait until the dom content is loaded (HTML is ready)</span>
  <span class="hljs-keyword">await</span> page.goto(<span class="hljs-string">"http://quotes.toscrape.com/"</span>, {
    <span class="hljs-attr">waitUntil</span>: <span class="hljs-string">"domcontentloaded"</span>,
  });

  <span class="hljs-comment">// Get page data</span>
  <span class="hljs-keyword">const</span> quotes = <span class="hljs-keyword">await</span> page.evaluate(<span class="hljs-function">() =&gt;</span> {
    <span class="hljs-comment">// Fetch the first element with class "quote"</span>
    <span class="hljs-keyword">const</span> quote = <span class="hljs-built_in">document</span>.querySelector(<span class="hljs-string">".quote"</span>);

    <span class="hljs-comment">// Fetch the sub-elements from the previously fetched quote element</span>
    <span class="hljs-comment">// Get the displayed text and return it (`.innerText`)</span>
    <span class="hljs-keyword">const</span> text = quote.querySelector(<span class="hljs-string">".text"</span>).innerText;
    <span class="hljs-keyword">const</span> author = quote.querySelector(<span class="hljs-string">".author"</span>).innerText;

    <span class="hljs-keyword">return</span> { text, author };
  });

  <span class="hljs-comment">// Display the quotes</span>
  <span class="hljs-built_in">console</span>.log(quotes);

  <span class="hljs-comment">// Close the browser</span>
  <span class="hljs-keyword">await</span> browser.close();
};

<span class="hljs-comment">// Start the scraping</span>
getQuotes();
</code></pre>
<p>Something interesting to point out is that the function name for selecting an element is the same as in the browser inspect. Here's an example:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/image-362.png" alt="Image" width="600" height="400" loading="lazy">
<em>After running the <code>document.querySelector</code> instruction in the browser inspector, we have the first quote as an output (like on Puppeteer)</em></p>
<p>Let's run our script one more time and see what we have as an output:</p>
<pre><code class="lang-json">{
  text: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
  author: 'Albert Einstein'
}
</code></pre>
<p>We did it! Our first scraped element is here, right in the terminal. Now, let's expand it and fetch all the current page quotes. 🔥</p>
<h3 id="heading-how-to-fetch-all-current-page-quotes">How to Fetch All Current Page Quotes</h3>
<p>Now that we know how to fetch one quote, let's trick our code a bit to get all the quotes and extract their data one by one.</p>
<p>Previously we used <code>document.getQuerySelector</code> to select the first matching element (the first quote). To be able to fetch all quotes, we will need the <code>document.querySelectorAll</code> function instead.</p>
<p>We'll need to follow these steps to make it work:</p>
<ul>
<li>Replace <code>document.getQuerySelector</code> with <code>document.querySelectorAll</code> (it'll fetch all <code>&lt;div&gt;</code> elements with the classname <code>quote</code> and return them)</li>
<li>Convert the fetched elements to a list with <code>Array.from(quoteList)</code> (it'll ensure the list of quotes is iterable)</li>
<li>Move our previous code to get the quote text and author inside the loop and return the result (it'll extract the elements with the classname <code>text</code> and <code>author</code> under <code>&lt;div class="quote"&gt;</code> for each quote)</li>
</ul>
<p>Here's the code update:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">import</span> puppeteer <span class="hljs-keyword">from</span> <span class="hljs-string">"puppeteer"</span>;

<span class="hljs-keyword">const</span> getQuotes = <span class="hljs-keyword">async</span> () =&gt; {
  <span class="hljs-comment">// Start a Puppeteer session with:</span>
  <span class="hljs-comment">// - a visible browser (`headless: false` - easier to debug because you'll see the browser in action)</span>
  <span class="hljs-comment">// - no default viewport (`defaultViewport: null` - website page will be in full width and height)</span>
  <span class="hljs-keyword">const</span> browser = <span class="hljs-keyword">await</span> puppeteer.launch({
    <span class="hljs-attr">headless</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-attr">defaultViewport</span>: <span class="hljs-literal">null</span>,
  });

  <span class="hljs-comment">// Open a new page</span>
  <span class="hljs-keyword">const</span> page = <span class="hljs-keyword">await</span> browser.newPage();

  <span class="hljs-comment">// On this new page:</span>
  <span class="hljs-comment">// - open the "http://quotes.toscrape.com/" website</span>
  <span class="hljs-comment">// - wait until the dom content is loaded (HTML is ready)</span>
  <span class="hljs-keyword">await</span> page.goto(<span class="hljs-string">"http://quotes.toscrape.com/"</span>, {
    <span class="hljs-attr">waitUntil</span>: <span class="hljs-string">"domcontentloaded"</span>,
  });

  <span class="hljs-comment">// Get page data</span>
  <span class="hljs-keyword">const</span> quotes = <span class="hljs-keyword">await</span> page.evaluate(<span class="hljs-function">() =&gt;</span> {
    <span class="hljs-comment">// Fetch the first element with class "quote"</span>
    <span class="hljs-comment">// Get the displayed text and returns it</span>
    <span class="hljs-keyword">const</span> quoteList = <span class="hljs-built_in">document</span>.querySelectorAll(<span class="hljs-string">".quote"</span>);

    <span class="hljs-comment">// Convert the quoteList to an iterable array</span>
    <span class="hljs-comment">// For each quote fetch the text and author</span>
    <span class="hljs-keyword">return</span> <span class="hljs-built_in">Array</span>.from(quoteList).map(<span class="hljs-function">(<span class="hljs-params">quote</span>) =&gt;</span> {
      <span class="hljs-comment">// Fetch the sub-elements from the previously fetched quote element</span>
      <span class="hljs-comment">// Get the displayed text and return it (`.innerText`)</span>
      <span class="hljs-keyword">const</span> text = quote.querySelector(<span class="hljs-string">".text"</span>).innerText;
      <span class="hljs-keyword">const</span> author = quote.querySelector(<span class="hljs-string">".author"</span>).innerText;

      <span class="hljs-keyword">return</span> { text, author };
    });
  });

  <span class="hljs-comment">// Display the quotes</span>
  <span class="hljs-built_in">console</span>.log(quotes);

  <span class="hljs-comment">// Close the browser</span>
  <span class="hljs-keyword">await</span> browser.close();
};

<span class="hljs-comment">// Start the scraping</span>
getQuotes();
</code></pre>
<p>As an end result, if we run our script one more time, we should have a list of quotes as an output. Each element of this list should have a text and an author property.</p>
<pre><code class="lang-json">[
  {
    text: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
    author: 'Albert Einstein'
  },
  {
    text: '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
    author: 'J.K. Rowling'
  },
  {
    text: '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
    author: 'Albert Einstein'
  },
  {
    text: '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
    author: 'Jane Austen'
  },
  {
    text: <span class="hljs-string">"“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"</span>,
    author: 'Marilyn Monroe'
  },
  {
    text: '“Try not to become a man of success. Rather become a man of value.”',
    author: 'Albert Einstein'
  },
  {
    text: '“It is better to be hated for what you are than to be loved for what you are not.”',
    author: 'André Gide'
  },
  {
    text: <span class="hljs-string">"“I have not failed. I've just found 10,000 ways that won't work.”"</span>,
    author: 'Thomas A. Edison'
  },
  {
    text: <span class="hljs-string">"“A woman is like a tea bag; you never know how strong it is until it's in hot water.”"</span>,
    author: 'Eleanor Roosevelt'
  },
  {
    text: '“A day without sunshine is like, you know, night.”',
    author: 'Steve Martin'
  }
]
</code></pre>
<p>Good job! All the quotes from the first page are now scraped by our script. 👏</p>
<h3 id="heading-how-to-move-to-the-next-page">How to Move to the Next Page</h3>
<p>Our script is now able to fetch all the quotes for one page. What would be interesting is clicking on the "Next page" at the page bottom and doing the same on the second page.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/image-363.png" alt="Image" width="600" height="400" loading="lazy">
<em>"Next" button at the Quotes to Scrape page bottom</em></p>
<p>Back to our browser inspect, and let's find how we can target this element using CSS selectors. </p>
<p>As we can notice, the next button is placed under an unordered list <code>&lt;ul&gt;</code> with a <code>pager</code> classname (<code>&lt;ul class="pager"&gt;</code>). This list has an element <code>&lt;li&gt;</code> with a <code>next</code> classname (<code>&lt;li class="next"&gt;</code>). Finally, there is a link anchor <code>&lt;a&gt;</code> that links to the second page (<code>&lt;a href="/page/2/"&gt;</code>).</p>
<p>In CSS, if we want to target this specific link there are different ways to do that. We can do:</p>
<ul>
<li><code>.next &gt; a</code>: but, it's risky because if there is an other element with <code>.next</code> as a parent element containing a link, it'll click on it.</li>
<li><code>.pager &gt; .next &gt; a</code>: safer, because we make sure the link should be inside the <code>.pager</code> parent element under the <code>.next</code> element. There is a low risk of having this hierarchy more than once.</li>
</ul>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/image-356.png" alt="Image" width="600" height="400" loading="lazy">
<em>An example of how the "Next" button is rendered in the HTML</em></p>
<p>To click this button, at the end of our script after the <code>console.log(quotes);</code>, you can add the following: <code>await page.click(".pager &gt; .next &gt; a");</code>.</p>
<p>Since we're now closing the browser page with <code>await browser.close();</code> after all instructions are done, you need to comment on this instruction to see the second page opened in the scraper browser.</p>
<p>It's temporary and for testing purposes, but the end of our <code>getQuotes</code> function should look like this:</p>
<pre><code class="lang-javascript">  <span class="hljs-comment">// Display the quotes</span>
  <span class="hljs-built_in">console</span>.log(quotes);

  <span class="hljs-comment">// Click on the "Next page" button</span>
  <span class="hljs-keyword">await</span> page.click(<span class="hljs-string">".pager &gt; .next &gt; a"</span>);

  <span class="hljs-comment">// Close the browser</span>
  <span class="hljs-comment">// await browser.close();</span>
</code></pre>
<p>After this, if you run our scraper again, after processing all instructions, your browser should stop on the second page:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/image-357.png" alt="Image" width="600" height="400" loading="lazy">
<em>Quotes to Scrape second page loaded after clicking the "Next" button</em></p>
<h2 id="heading-its-your-time-heres-what-you-can-do-next">It’s Your Time! Here’s What You Can Do Next:</h2>
<p>Congrats on reaching the end of this introduction to scraping with Puppeteer! 👏</p>
<p>Now it's your turn to improve the scraper and make it get more data from the Quotes to Scrape website. Here's a list of potential improvements you can make:</p>
<ul>
<li>Navigate between all pages using the "Next" button and fetch the quotes on all the pages.</li>
<li>Fetch the quote's tags (each quote has a list of tags).</li>
<li>Scrape the author's about page (by clicking on the author's name on each quote).</li>
<li>Categorize the quotes by tags or authors (it's not 100% related to the scraping itself, but that can be a good improvement).</li>
</ul>
<p>Feel free to be creative and do any other things you see fit 🚀</p>
<h3 id="heading-scraper-code-is-available-on-github">Scraper Code Is Available on GitHub</h3>
<p>Check out the latest version of our scraper on GitHub! You're free to save, fork, or utilize it as you see fit.</p>
<p>=&gt; <a target="_blank" href="https://github.com/gaelgthomas/first-puppeteer-scraper-example">First Puppeteer Scraper (example)</a></p>
<h2 id="heading-successful-scraping-start-thanks-for-reading-the-article">Successful Scraping Start: Thanks for reading the article!</h2>
<p>I hope this article gave you a valuable introduction to web scraping using JavaScript and Puppeteer. Writing this was a pleasure, and I hope you found it informative and enjoyable.</p>
<p><a target="_blank" href="https://twitter.com/gaelgthomas">Join me on Twitter</a> for more content like this. I regularly share content to help you grow your web development skills and would love to have you join the conversation. Let's learn, grow, and inspire each other along the way!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use Python to Scrape App Store Reviews ]]>
                </title>
                <description>
                    <![CDATA[ By Shittu Olumide Data scraping, commonly referred to as web scraping, is a technique for getting data and content from the internet. You usually keep this information in a local file so that you can change and inspect it as needed.  Web scraping is ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-use-python-to-scrape-app-store-reviews/</link>
                <guid isPermaLink="false">66d4610f38f2dc3808b790f9</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ web scraping ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Fri, 16 Sep 2022 18:04:19 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/09/Shittu-Olumide-How-to-use-Python-to-scrape-App-Store-reviews-Freecodecamp.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Shittu Olumide</p>
<p>Data scraping, commonly referred to as web scraping, is a technique for getting data and content from the internet.</p>
<p>You usually keep this information in a local file so that you can change and inspect it as needed. </p>
<p>Web scraping is basically just copying and pasting content from a website into an Excel spreadsheet on a very small scale.</p>
<p>The main goal of this article is to help you get started in web scraping using quick and easy steps. You will learn how to scrape app store reviews using the <code>app_store_scraper</code> library in Python. There are other tools and libraries you can use such as <code>Scrapy</code>, <code>Pandas</code>, and <code>BeautifulSoup</code> ,but here we will use the use the  <code>app_store_scraper</code>. </p>
<p>Depending on the mechanism you select for web scraping, it might be either really simple or quite complex. </p>
<p>Fortunately, there is straightforward and excellent software that can help you gather reviews about your app from the Apple app store and use them for further sentiment analysis.</p>
<h3 id="heading-why-is-web-scraping-even-useful">Why is web scraping even useful?</h3>
<p>Data analytics professionals employ web scraping for a variety of tasks, including lead creation, market analysis, consumer sentiment analysis, and data integration.</p>
<p>You can also use web scraping to track stock prices, online opportunities (such as scholarships, employment, internships, and so on), competitors' inventory data, and customer reviews and ratings.  </p>
<p>In this article, you will learn how to use Python to scrape app store reviews in 4 easy steps. </p>
<p>Before you start, here's something to keep in mind: some sites don't allow you to scrape their content, so be sure you check before doing so. Web scraping isn't precisely forbidden, but you should take care to know when/where you can scrape. I strongly recommend that you scrape for informational and educational purposes only.</p>
<h2 id="heading-step-1-install-and-setup-packages">Step 1 – Install and Setup Packages</h2>
<p>First, you have to install and setup the necessary packages. In this step you will install the <code>app_store_scraper</code> using the Python package installer.</p>
<pre><code class="lang-py">pip install app_store_scraper 

<span class="hljs-comment">#or</span>

pip3 install app_store_scraper
</code></pre>
<h2 id="heading-step-2-get-apps-name-and-id">Step 2 – Get App's Name and ID</h2>
<p>I will be using a random app and I will be scraping its reviews for the sake of this demo. But if have a personal app that you built and you have it on app store, you can use that app with these same techniques. You just need to get the app's name and ID, which you can find by typing the name of the app into Google using your PC. </p>
<p>Example: "<em>Slack app on apple app store</em>"</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/09/slack-google-search.PNG" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You should click on the first result which will redirect you to the official Apple store. There you will find the "Slack app" and everything about it. </p>
<p>Once the page loads in the URL you will see the app name (Slack) and app ID (618783545). Copy it down in your notepad.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/09/slack-app-name-app-id.PNG" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Now you'll need to import some packages and run some code:</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> json

<span class="hljs-keyword">from</span> app_store_scraper <span class="hljs-keyword">import</span> AppStore
slack = AppStore(country=<span class="hljs-string">'us'</span>, app_name=<span class="hljs-string">'slack'</span>, app_id = <span class="hljs-string">'618783545'</span>)

slack.review(how_many=<span class="hljs-number">2000</span>)
</code></pre>
<p>In the code above, you will import the <code>pandas</code> library which helps you add evaluations/reviews to a dataframe. You'll also import the <code>numpy</code> library for data transformation and modification. Finally, you'll get the <code>app_store_scraper</code> package itself for scraping the reviews from the website. </p>
<p>You will have to create and instance of the <code>Appstore</code> class, then pass in the arguments <code>country</code>, <code>app_name</code>, and the <code>app_id</code>. </p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/09/slack-web-scraping.PNG" alt="Image" width="600" height="400" loading="lazy">
<em>slack app ratings</em></p>
<p>The reviews are all stored in the <code>slack</code> variable, so run the command below to see the reviews stored in JSON format.</p>
<pre><code class="lang-py">slack.reviews
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/09/slack-reviews.PNG" alt="Image" width="600" height="400" loading="lazy">
<em>slack app scraped reviews</em></p>
<h2 id="heading-step-3-convert-data-from-json">Step 3 – Convert Data from JSON</h2>
<p>To make data more readable and properly formatted, you need to convert it from JSON format to a Pandas dataframe. You can do that with the following code:</p>
<pre><code class="lang-py">slackdf = pd.DataFrame(np.array(slack.reviews),columns=[<span class="hljs-string">'review'</span>])
slackdf2 = df.join(pd.DataFrame(slackdf.pop(<span class="hljs-string">'review'</span>).tolist()))
slackdf2.head()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/09/slack-generated-reviews.PNG" alt="Image" width="600" height="400" loading="lazy">
<em>generated reviews in pandas dataframe</em></p>
<h2 id="heading-step-4-convert-the-dataframe-to-csv">Step 4 – Convert the Dataframe to CSV</h2>
<p>Here is the final step: you will covert the dataframe into CSV (comma-separated value) format so that you can have it on your local machine. Then you can view it in a spreadsheet and also share it with a colleague.</p>
<pre><code class="lang-py">slackdf2.to_csv(<span class="hljs-string">'Slack-app-reviews.csv'</span>)
</code></pre>
<p>Finally, you should have your "Slack-app-reviews.csv" file saved into your project folder and you're ready to go. </p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this short article, you were able to scrape Slack app store reviews into a dataframe and then save it into your local machine using 4 easy steps. I hope you enjoyed it, cheers.</p>
<p>Here is the <a target="_blank" href="https://github.com/zenUnicorn/App-rating-scraper-with-python">GitHub repo</a> where I hosted the code, feel free to star the repository.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Web Scraping in Python – How to Scrape Sci-Fi Movies from IMDB ]]>
                </title>
                <description>
                    <![CDATA[ By Riley Predum Have you ever struggled to find a dataset for your data science project? If you're like I am, the answer is yes.  Luckily, there are many free datasets available – but sometimes you want something more specific or bespoke. For that, w... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/web-scraping-sci-fi-movies-from-imdb-with-python/</link>
                <guid isPermaLink="false">66d460c5677cb8c6c15f3177</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ web scraping ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 09 Aug 2022 19:50:10 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/07/pexels-pixabay-270348--1-.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Riley Predum</p>
<p>Have you ever struggled to find a dataset for your data science project? If you're like I am, the answer is yes. </p>
<p>Luckily, there are many free datasets available – but sometimes you want something more specific or bespoke. For that, web scraping is a good skill to have in your toolbox to pull data off your favorite website.</p>
<h2 id="heading-whats-covered-in-this-article">What’s Covered in this Article?</h2>
<p>This article has a Python script you can use to scrape the data on sci-fi movies (or whatever genre you choose!) from the <a target="_blank" href="https://www.imdb.com/">IMDB</a> website. It can then write these data to a dataframe for further exploration. </p>
<p>I will conclude this article with a bit of exploratory data analysis (EDA). Through this, you will see what further data science projects are possible for you to try.</p>
<p><em>Disclaimer: while web scraping is a great way of programmatically pulling data off of websites, please do so responsibly. My script uses the sleep function, for example, to slow down the pull requests intentionally, so as not to overload IMDB's servers. Some websites frown upon the use of web scrapers, so use it wisely.</em></p>
<h2 id="heading-web-scraping-and-data-cleaning-script">Web Scraping and Data Cleaning Script</h2>
<p>Let’s get to the scraping script and get that running. The script pulls in movie titles, years, ratings (PG-13, R, and so on), genres, runtimes, reviews, and votes for each movie. You can choose how many pages you want to scrape based on your data needs. </p>
<p>_Note: it will take longer the more pages you select. It takes 40 min to scrape 200 webpages using the <a target="_blank" href="https://colab.research.google.com/drive/11avx1TqYw_2sb5tUNi0ZO4ABRc50LNxK?usp=sharing">Google Colab Notebook</a>._</p>
<p>For those of you who have not tried it before, Google Colab is a cloud-based <a target="_blank" href="https://realpython.com/jupyter-notebook-introduction/#:~:text=The%20Jupyter%20Notebook%20is%20an,the%20people%20at%20Project%20Jupyter.">Jupyter Notebook</a> style Python development tool that lives in the Google app suite. You can use it out of the box with many of the packages already installed that are common in data science. </p>
<p>Below is an image of the Colab workspace and its layout:</p>
<p><img src="https://lh4.googleusercontent.com/R9sAuHzGHrEvRK_hiAWsy4W41W72et6clD38gIYeAA6AtA32e97xxw0W5ub_96xmgSMTDB2VjRK-gz_YgYtZoV1YyCHjKftaB7-HD2NQ7qt_8hcdnDfqaibp0ONwPr9-4zO5gv3FuXdxiOMsN6eF8bA" alt="Image" width="600" height="400" loading="lazy">
<em>Introducing the Google Colab user interface</em></p>
<p>With that, let’s dive in! First things first, you should always import your packages as their own cell. If you forget a package you can re-run just that cell. This cuts down on development time. </p>
<p>Note: some of these packages need <code>pip install package_name</code> to be run to install them first. If you choose to run the code locally using something like a Jupyter Notebook you'll need to do that. If you want to get up and running quickly, you can use the Google Colab notebook. This has all these installed by default.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> requests <span class="hljs-keyword">import</span> get
<span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup
<span class="hljs-keyword">from</span> warnings <span class="hljs-keyword">import</span> warn
<span class="hljs-keyword">from</span> time <span class="hljs-keyword">import</span> sleep
<span class="hljs-keyword">from</span> random <span class="hljs-keyword">import</span> randint
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np, pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
</code></pre>
<h2 id="heading-how-to-do-the-web-scraping">How to Do the Web Scraping</h2>
<p>You can run the following code which does the actual web scraping. It will pull all the columns mentioned above into arrays and populate them one movie at a time, one page at a time. </p>
<p>There are also some data cleaning steps I have added and documented in this code as well. I removed parentheses from string data mentioning the year of the film for example. I then converted those to integers. Things like this make exploratory data analysis and modeling easier.</p>
<p>Note that I use the sleep function to avoid being restricted by IMDB when it comes to cycling through their web pages too quickly.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Note this takes about 40 min to run if np.arange is set to 9951 as the stopping point.</span>

pages = np.arange(<span class="hljs-number">1</span>, <span class="hljs-number">9951</span>, <span class="hljs-number">50</span>) <span class="hljs-comment"># Last time I tried, I could only go to 10000 items because after that the URI has no discernable pattern to combat webcrawlers; I just did 4 pages for demonstration purposes. You can increase this for your own projects.</span>
headers = {<span class="hljs-string">'Accept-Language'</span>: <span class="hljs-string">'en-US,en;q=0.8'</span>} <span class="hljs-comment"># If this is not specified, the default language is Mandarin</span>

<span class="hljs-comment">#initialize empty lists to store the variables scraped</span>
titles = []
years = []
ratings = []
genres = []
runtimes = []
imdb_ratings = []
imdb_ratings_standardized = []
metascores = []
votes = []

<span class="hljs-keyword">for</span> page <span class="hljs-keyword">in</span> pages:

   <span class="hljs-comment">#get request for sci-fi</span>
   response = get(<span class="hljs-string">"https://www.imdb.com/search/title?genres=sci-fi&amp;"</span>
                  + <span class="hljs-string">"start="</span>
                  + str(page)
                  + <span class="hljs-string">"&amp;explore=title_type,genres&amp;ref_=adv_prv"</span>, headers=headers)

   sleep(randint(<span class="hljs-number">8</span>,<span class="hljs-number">15</span>))

   <span class="hljs-comment">#throw warning for status codes that are not 200</span>
   <span class="hljs-keyword">if</span> response.status_code != <span class="hljs-number">200</span>:
       warn(<span class="hljs-string">'Request: {}; Status code: {}'</span>.format(requests, response.status_code))

   <span class="hljs-comment">#parse the content of current iteration of request</span>
   page_html = BeautifulSoup(response.text, <span class="hljs-string">'html.parser'</span>)

   movie_containers = page_html.find_all(<span class="hljs-string">'div'</span>, class_ = <span class="hljs-string">'lister-item mode-advanced'</span>)

   <span class="hljs-comment">#extract the 50 movies for that page</span>
   <span class="hljs-keyword">for</span> container <span class="hljs-keyword">in</span> movie_containers:

       <span class="hljs-comment">#conditional for all with metascore</span>
       <span class="hljs-keyword">if</span> container.find(<span class="hljs-string">'div'</span>, class_ = <span class="hljs-string">'ratings-metascore'</span>) <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:

           <span class="hljs-comment">#title</span>
           title = container.h3.a.text
           titles.append(title)

           <span class="hljs-keyword">if</span> container.h3.find(<span class="hljs-string">'span'</span>, class_= <span class="hljs-string">'lister-item-year text-muted unbold'</span>) <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:

             <span class="hljs-comment">#year released</span>
             year = container.h3.find(<span class="hljs-string">'span'</span>, class_= <span class="hljs-string">'lister-item-year text-muted unbold'</span>).text <span class="hljs-comment"># remove the parentheses around the year and make it an integer</span>
             years.append(year)

           <span class="hljs-keyword">else</span>:
             years.append(<span class="hljs-literal">None</span>) <span class="hljs-comment"># each of the additional if clauses are to handle type None data, replacing it with an empty string so the arrays are of the same length at the end of the scraping</span>

           <span class="hljs-keyword">if</span> container.p.find(<span class="hljs-string">'span'</span>, class_ = <span class="hljs-string">'certificate'</span>) <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:

             <span class="hljs-comment">#rating</span>
             rating = container.p.find(<span class="hljs-string">'span'</span>, class_= <span class="hljs-string">'certificate'</span>).text
             ratings.append(rating)

           <span class="hljs-keyword">else</span>:
             ratings.append(<span class="hljs-string">""</span>)

           <span class="hljs-keyword">if</span> container.p.find(<span class="hljs-string">'span'</span>, class_ = <span class="hljs-string">'genre'</span>) <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:

             <span class="hljs-comment">#genre</span>
             genre = container.p.find(<span class="hljs-string">'span'</span>, class_ = <span class="hljs-string">'genre'</span>).text.replace(<span class="hljs-string">"\n"</span>, <span class="hljs-string">""</span>).rstrip().split(<span class="hljs-string">','</span>) <span class="hljs-comment"># remove the whitespace character, strip, and split to create an array of genres</span>
             genres.append(genre)

           <span class="hljs-keyword">else</span>:
             genres.append(<span class="hljs-string">""</span>)

           <span class="hljs-keyword">if</span> container.p.find(<span class="hljs-string">'span'</span>, class_ = <span class="hljs-string">'runtime'</span>) <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:

             <span class="hljs-comment">#runtime</span>
             time = int(container.p.find(<span class="hljs-string">'span'</span>, class_ = <span class="hljs-string">'runtime'</span>).text.replace(<span class="hljs-string">" min"</span>, <span class="hljs-string">""</span>)) <span class="hljs-comment"># remove the minute word from the runtime and make it an integer</span>
             runtimes.append(time)

           <span class="hljs-keyword">else</span>:
             runtimes.append(<span class="hljs-literal">None</span>)

           <span class="hljs-keyword">if</span> float(container.strong.text) <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:

             <span class="hljs-comment">#IMDB ratings</span>
             imdb = float(container.strong.text) <span class="hljs-comment"># non-standardized variable</span>
             imdb_ratings.append(imdb)

           <span class="hljs-keyword">else</span>:
             imdb_ratings.append(<span class="hljs-literal">None</span>)

           <span class="hljs-keyword">if</span> container.find(<span class="hljs-string">'span'</span>, class_ = <span class="hljs-string">'metascore'</span>).text <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:

             <span class="hljs-comment">#Metascore</span>
             m_score = int(container.find(<span class="hljs-string">'span'</span>, class_ = <span class="hljs-string">'metascore'</span>).text) <span class="hljs-comment"># make it an integer</span>
             metascores.append(m_score)

           <span class="hljs-keyword">else</span>:
             metascores.append(<span class="hljs-literal">None</span>)

           <span class="hljs-keyword">if</span> container.find(<span class="hljs-string">'span'</span>, attrs = {<span class="hljs-string">'name'</span>:<span class="hljs-string">'nv'</span>})[<span class="hljs-string">'data-value'</span>] <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:

             <span class="hljs-comment">#Number of votes</span>
             vote = int(container.find(<span class="hljs-string">'span'</span>, attrs = {<span class="hljs-string">'name'</span>:<span class="hljs-string">'nv'</span>})[<span class="hljs-string">'data-value'</span>])
             votes.append(vote)

           <span class="hljs-keyword">else</span>:
               votes.append(<span class="hljs-literal">None</span>)

           <span class="hljs-keyword">else</span>:
               votes.append(<span class="hljs-literal">None</span>)
</code></pre>
<p>Pandas dataframes take as input arrays of data for each of their columns in key:value pairs. I did a couple extra data cleaning steps here to finalize the data cleaning. </p>
<p>After you run the following cell, you should have a dataframe with the data you scraped.</p>
<pre><code class="lang-python">sci_fi_df = pd.DataFrame({<span class="hljs-string">'movie'</span>: titles,
                      <span class="hljs-string">'year'</span>: years,
                      <span class="hljs-string">'rating'</span>: ratings,
                      <span class="hljs-string">'genre'</span>: genres,
                      <span class="hljs-string">'runtime_min'</span>: runtimes,
                      <span class="hljs-string">'imdb'</span>: imdb_ratings,
                      <span class="hljs-string">'metascore'</span>: metascores,
                      <span class="hljs-string">'votes'</span>: votes}
                      )

sci_fi_df.loc[:, <span class="hljs-string">'year'</span>] = sci_fi_df[<span class="hljs-string">'year'</span>].str[<span class="hljs-number">-5</span>:<span class="hljs-number">-1</span>] <span class="hljs-comment"># two more data transformations after scraping</span>
<span class="hljs-comment"># Drop 'ovie' bug</span>
<span class="hljs-comment"># Make year an int</span>
sci_fi_df[<span class="hljs-string">'n_imdb'</span>] = sci_fi_df[<span class="hljs-string">'imdb'</span>] * <span class="hljs-number">10</span>
final_df = sci_fi_df.loc[sci_fi_df[<span class="hljs-string">'year'</span>] != <span class="hljs-string">'ovie'</span>] <span class="hljs-comment"># One small issue with the scrape on these two movies so just dropping those ones.</span>
final_df.loc[:, <span class="hljs-string">'year'</span>] = pd.to_numeric(final_df[<span class="hljs-string">'year'</span>])
</code></pre>
<h2 id="heading-exploratory-data-analysis">Exploratory Data Analysis</h2>
<p>Now that you have the data, one of the first things you might want to do is learn more about it at a high level. The following commands are a useful first look at any data and we’ll use them next:</p>
<pre><code class="lang-python">final_df.head()
</code></pre>
<p>This command shows you the first 5 rows of your dataframe. It helps you see that nothing looks weird and everything is ready for analysis. You can see the output here:</p>
<p><img src="https://lh3.googleusercontent.com/TCYKlpEKIJOVJIAtIGN4wzDhCySaYIXI9cyBizZxR3XHsAQO_YH9mh626hCq8fdItaAF0N0cxSs1PP1eYujRsOt8HgeXtcC3hff-y0Jl4tvN__itH97iXqb6DrN6wJrngdsNaKQTQag5StHfOIcy5A0" alt="Image" width="600" height="400" loading="lazy">
_The first five rows of data outputted using the <code>final_df.head()</code> command_</p>
<pre><code class="lang-python">final_df.describe()
</code></pre>
<p>This command will provide you with the mean, standard deviation, and other summaries. Count can show you if there are any null values in some of the columns which is useful information to know. The year column, for example, shows you the range of movies scraped – from 1927 to 2022. </p>
<p>You can see the output below and inspect the others:</p>
<p><img src="https://lh6.googleusercontent.com/Zeo_Y8ipyIejyYIBa2Aaocz4obHNlMVU76YTylZGl_wpRovYVFNS4e0m1DYAwkcqhpoYikJFL_dSgZSH-qoghJM3VMXESMUykrfs1e3JuXRkrp9iEZhPPnqGvsSamdYQe6Noz0Q0OA-Wen616-pmbDQ" alt="Image" width="600" height="400" loading="lazy">
_Running <code>final_df.describe()</code> produces summary statistics showing the number of data points, averages, standard deviations, and more._</p>
<pre><code class="lang-python">final_df.info()
</code></pre>
<p>This command lets you know the data types you are working with in each of your columns. </p>
<p>As a data scientist, this information can be helpful to you. Certain functions and methods need certain data types. You can also ensure that your underlying data types are in a format that makes sense for what they are. </p>
<p>For example: a 5 star rating should be a float or int (if decimals are not allowed). It should not be a string since it's a number. Here’s a summary of what the data format was for each variable after the scraping:</p>
<p><img src="https://lh3.googleusercontent.com/PT7Fa9XFYErtorVw6bNxw7Q1mI-p2_hlKgTbTs90RRpALPDlqd95F_EOwCQ7cV2cDymqZ-mXIa_0blqxxJ5wZ8Bznzd0iFyTB6kFroIUK2DJNzfRZgwgsRHr0pjDyE1ZUrQILf-22w856OoufnnKmRI" alt="Image" width="600" height="400" loading="lazy">
_Running <code>final_df.info()</code> results in showing you how many values you have in each column and what their data types are._</p>
<p>The next command to learn more about your variables produces a heatmap. The heatmap shows the correlation between all your quantitative variables. This is a quick way to assess relationships that may exist between variables. I like to see the coefficients rather than trying to decipher the color code, so I use the <code>annot=True</code> argument.</p>
<pre><code class="lang-python">sns.heatmap(final_df.corr(), annot=<span class="hljs-literal">True</span>);
</code></pre>
<p>The command above produces the following visualization using the Seaborn data visualization package:</p>
<p><img src="https://lh3.googleusercontent.com/niHLKP7bps1EpZ_39u5k3dPDF0Xuz8Zuhal8Bbc8wtImKUv50M_7fEH65rCAkrTglGtZTJpZ2sRfIE0E6Kjn9m_CYGkRct83_3wWzVp0rnHA8nh5UuveFO0OqtjVfoOzMsKGq0lZ2uxw66Lp4g69aMo" alt="Image" width="600" height="400" loading="lazy">
_A heatmap of correlations after running <code>sns.heatmap(final_df.corr(), annot=True);</code>_</p>
<p>You can see that the strongest correlation is between the IMDB score and the metascore. This is not surprising since it's likely that two movie rating systems rate similarly.</p>
<p>The next strongest correlation you can see is between the IMDB rating and the number of votes. This is interesting because as the number of votes increases, you have a more representative sample of the population rating. It's strange to see that there is a weak association between the two, though.</p>
<p>The number of votes roughly increases as the runtime increases as well.</p>
<p>You can also see a slight negative association between IMDB or metascore and the year the movie came out. We’ll look at this shortly.</p>
<p>You can check out some of these relationships visually via a scatter plot with this code:</p>
<pre><code class="lang-python">x = final_df[<span class="hljs-string">'n_imdb'</span>]
y = final_df[<span class="hljs-string">'votes'</span>]
plt.scatter(x, y, alpha=<span class="hljs-number">0.5</span>) <span class="hljs-comment"># s= is size var, c= is color var</span>
plt.xlabel(<span class="hljs-string">"IMDB Rating Standardized"</span>)
plt.ylabel(<span class="hljs-string">"Number of Votes"</span>)
plt.title(<span class="hljs-string">"Number of Votes vs. IMDB Rating"</span>)
plt.ticklabel_format(style=<span class="hljs-string">'plain'</span>)
plt.show()
</code></pre>
<p>This results in this visualization:</p>
<p><img src="https://lh3.googleusercontent.com/vvqxh5VwbHPoypyGlNBstgZW8puVWKa5m_hl6MYB_r78OfRC7TWBx9jxjf8PFflJO93hq83ZdIqX97uq6C_WjlZV5jorCDgtU3U3_dESuUgsStfLEgkeiikHTq2noabW_tPJQRRGpFrVmQ90gja4xAo" alt="Image" width="600" height="400" loading="lazy">
<em>IMDB Ratings vs. the Number of Votes</em></p>
<p>The association above shows some outliers. Generally, we see a greater number of votes on movies that have an IMDB rating of 85 or more. There are fewer reviews on movies with a rating of 75 or less. </p>
<p>Drawing these boxes around the data can show you what I mean. There's roughly two groupings of different magnitudes:</p>
<p><img src="https://lh6.googleusercontent.com/QEUbjZrtSiLCbdcXIR1MN0MKvcCZgVxeW2sPzMo4KL36pjCQq87rkRdgKKwK2yWSh2Uz0HMoIckyOa0qcNX4hCQok_kuuyqq4PddFHVuC5Tzyg9-WdZobdZgWfOpW1PnKWFKfQLaDAEDXoDHfiuU5mY" alt="Image" width="600" height="400" loading="lazy">
<em>Two Core Groups in the Data</em></p>
<p>Another thing that might be interesting to see is how many movies of each rating there are. This can show you where Sci-Fi tends to land in the ratings data. Use this code to get a bar chart of the ratings:</p>
<pre><code class="lang-python">ax = final_df[<span class="hljs-string">'rating'</span>].value_counts().plot(kind=<span class="hljs-string">'bar'</span>,
                                   figsize=(<span class="hljs-number">14</span>,<span class="hljs-number">8</span>),
                                   title=<span class="hljs-string">"Number of Movies by Rating"</span>)
ax.set_xlabel(<span class="hljs-string">"Rating"</span>)
ax.set_ylabel(<span class="hljs-string">"Number of Movies"</span>)
ax.plot();
</code></pre>
<p>That code results in this chart which shows us that R and PG-13 make up the majority of these Sci-Fi movies on IMDB.</p>
<p><img src="https://lh3.googleusercontent.com/7896rs2HtqgI4nIPyI-vUU5w43C3_Dcuyc_DdjUOudq76aIHstBINNVf5e0-1G3MUZzFgKDzK_2Jhsnno5swbXIoZwMuxqg1icY8aPbxWjOsCIm3BB9lObzY7HiDSAhmLTfcpfi2HWdW4VoUjBcnbrk" alt="Image" width="600" height="400" loading="lazy">
<em>Number of Movies by Rating</em></p>
<p>I did see that there were a few movies rated as “Approved” and was curious what that was. You can filter down the dataframe with this code to drill down into that:</p>
<pre><code class="lang-python">final_df[final_df[<span class="hljs-string">'rating'</span>] == <span class="hljs-string">'Approved'</span>]
</code></pre>
<p>This revealed that most of these movies were made before the 80s:</p>
<p><img src="https://lh3.googleusercontent.com/OIxMNDTgcXcPo_Wy8N7miq44OOAai4o-A8upYa1pNbqWjDVzPduRxNcMgUPuG9-OFyNd1AFgwWeq_o4E3Kv9pXy8xVSH7p6ZZi9uoOy78dBFK0LjvDnN9k7WYDTiZYxwpVgCiqXokWLPuMo746jvWMo" alt="Image" width="600" height="400" loading="lazy">
<em>All Rating "Approved" Movies</em></p>
<p>I went to the MPAA website and there was no mention of them on their ratings information page. It must have been phased out at some point.</p>
<p>You could also check out if any years or decades outperformed others on reviews. I took the average metascore by year and plotted that with the following code to explore further:</p>
<pre><code class="lang-python"><span class="hljs-comment"># What are the average metascores by year?</span>
final_df.groupby(<span class="hljs-string">'year'</span>)[<span class="hljs-string">'metascore'</span>].mean().plot(kind=<span class="hljs-string">'bar'</span>, figsize=(<span class="hljs-number">16</span>,<span class="hljs-number">8</span>), title=<span class="hljs-string">"Avg. Metascore by Year"</span>, xlabel=<span class="hljs-string">"Year"</span>, ylabel=<span class="hljs-string">"Avg. Metascore"</span>)
plt.xticks(rotation=<span class="hljs-number">90</span>)
plt.plot();
</code></pre>
<p>This results in the following chart:</p>
<p><img src="https://lh5.googleusercontent.com/BTwJBQhq0zBTr5UH5J3n7CSR6k3Ft8l9GBQ_czZRlu_LO192AXd0G_ozwXzsb6pctS-8lHCvgVLx6VZ7hWH-trp8C4oFAPCGufh2gq-F2WV96u90xt05KqUGYCqSFpmXxPEsFKSZQceglNItwChRnfE" alt="Image" width="600" height="400" loading="lazy">
<em>Avg. Metascore by Movie Year</em></p>
<p>Now I’m not saying I know why, but there is a gradual, mild decline as you progress through history in the average metascore variable. It seems that ratings have leveled out around 55-60 in the last couple decades. This might be because we have more data on newer movies or newer movies tend to get reviewed more.</p>
<pre><code class="lang-python">final_df[<span class="hljs-string">'year'</span>].value_counts().plot(kind=<span class="hljs-string">'bar'</span>, figsize=[<span class="hljs-number">20</span>,<span class="hljs-number">9</span>])
</code></pre>
<p>Run the above code and you will see that the 1927 movie only had a sample of 1 review. That score is then biased and over-inflated. You will see too that the more recent movies are better represented in reviews as I suspected:</p>
<p><img src="https://lh6.googleusercontent.com/d2C-t1DLqSjRY8DpeoudyBNHG4SevXCZXFK4xoaw3QHpj_j4qnEf479Tn7wNyBqOwKAzR5GVidaRZ79XB5Eo36msA8LRBxNaJu_9Xk1VKE5oeo2Pue1TLnbjMX3y48Gc5xfOBnlZ1x9rdTktI_N2Bpg" alt="Image" width="600" height="400" loading="lazy">
<em>Number of Movies by Year</em></p>
<h2 id="heading-data-science-project-ideas-to-take-this-further">Data Science Project Ideas to Take This Further</h2>
<p>You have textual, categorical, and numeric variables here. There are a few options you could try to explore more.</p>
<p>One thing you could do is use Natural Language Process (NLP) to see if there are any naming conventions to movie ratings, or within the world of Sci-Fi (or if you chose to do a different genre, whatever genre you chose!). </p>
<p>You could change the web scraping code to pull in many more genres too. With that you could create a new inter-genre database to see if there are naming conventions by genre.</p>
<p>You could then try to predict the genre based on the name of the movie. You could also try to predict the IMDB rating based on the genre or year the movie came out. The latter idea would work better in the last few decades since most observations are there.</p>
<p>I hope this tutorial sparked curiosity in you about the world of data science and what's possible! </p>
<p>You’ll find in exploratory data analysis that there are always more questions to ask. Working with that constraint is about prioritizing based on business goal(s). It’s important to start with those objectives up front or you could be in the data weeds exploring forever.</p>
<p>If the field of data science is interesting to you and you want to expand your skillset and enter into it professionally, consider checking out Springboard’s <a target="_blank" href="https://www.springboard.com/courses/data-science-career-track/">Data Science Career Track</a>. In this course, Springboard guides you through all of the key concepts in depth with a 1:1 expert mentor paired with you to support you on your journey.</p>
<p>I've written other articles that frame data science projects in relation to business problems and walk through technical approaches to solving them on my <a target="_blank" href="https://medium.com/@rileypredum">Medium</a>. Check those out if you are interested!</p>
<p>Happy coding!</p>
<p>Riley</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Web Scraping with Python – How to Scrape Data from Twitter using Tweepy and Snscrape ]]>
                </title>
                <description>
                    <![CDATA[ If you are a data enthusiast, you'll likely agree that one of the richest sources of real-world data is social media. Sites like Twitter are full of data. You can use the data you can get from social media in a number of ways, like sentiment analysis... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/python-web-scraping-tutorial/</link>
                <guid isPermaLink="false">66d45f3e4a7504b7409c3411</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ web scraping ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ibrahim Ogunbiyi ]]>
                </dc:creator>
                <pubDate>Tue, 12 Jul 2022 17:58:29 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/07/scraping-with-python-article-image.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you are a data enthusiast, you'll likely agree that one of the richest sources of real-world data is social media. Sites like Twitter are full of data.</p>
<p>You can use the data you can get from social media in a number of ways, like sentiment analysis (analyzing people's thoughts) on a specific issue or field of interest.</p>
<p>There are several ways you can scrape (or gather) data from Twitter. And in this article, we will look at two of those ways: using Tweepy and Snscrape.</p>
<p>We will learn a method to scrape public conversations from people on a specific trending topic, as well as tweets from a particular user.</p>
<p>Now without further ado, let’s get started.</p>
<h2 id="heading-tweepy-vs-snscrape-introduction-to-our-scraping-tools">Tweepy vs Snscrape – Introduction to Our Scraping Tools</h2>
<p>Now, before we get into the implementation of each platform, let's try to grasp the differences and limits of each platform.</p>
<h3 id="heading-tweepy">Tweepy</h3>
<p>Tweepy is a Python library for integrating with the Twitter API. Because Tweepy is connected with the Twitter API, you can perform complex queries in addition to scraping tweets. It enables you to take advantage of all of the Twitter API's capabilities.</p>
<p>But there are some drawbacks – like the fact that its standard API only allows you to collect tweets for up to a week (that is, Tweepy does not allow recovery of tweets beyond a week window, so historical data retrieval is not permitted).</p>
<p>Also, there are limits to how many tweets you can retrieve from a user's account. You can <a target="_blank" href="https://www.tweepy.org/">read more about Tweepy's functionalities here</a>.</p>
<h3 id="heading-snscrape">Snscrape</h3>
<p>Snscrape is another approach for scraping information from Twitter that does not require the use of an API. Snscrape allows you to scrape basic information such as a user's profile, tweet content, source, and so on.</p>
<p>Snscrape is not limited to Twitter, but can also scrape content from other prominent social media networks like Facebook, Instagram, and others.</p>
<p>Its advantages are that there are no limits to the number of tweets you can retrieve or the window of tweets (that is, the date range of tweets). So Snscrape allows you to retrieve old data.</p>
<p>But the one disadvantage is that it lacks all the other functionalities of Tweepy – still, if you only want to scrape tweets, Snscrape would be enough.</p>
<p>Now that we've clarified the distinction between the two methods, let's go over their implementation one by one.</p>
<h2 id="heading-how-to-use-tweepy-to-scrape-tweets">How to Use Tweepy to Scrape Tweets</h2>
<p>Before we begin using Tweepy, we must first make sure that our Twitter credentials are ready. With that, we can connect Tweepy to our API key and begin scraping.</p>
<p>If you do not have Twitter credentials, you can register for a Twitter developer account by going <a target="_blank" href="https://developer.twitter.com/">here</a>. You will be asked some basic questions about how you intend to use the Twitter API. After that, you can begin the implementation.</p>
<p>The first step is to install the Tweepy library on your local machine, which you can do by typing:</p>
<pre><code class="lang-javascript">pip install git+https:<span class="hljs-comment">//github.com/tweepy/tweepy.git</span>
</code></pre>
<h3 id="heading-how-to-scrape-tweets-from-a-user-on-twitter">How to Scrape Tweets from a User on Twitter</h3>
<p>Now that we’ve installed the Tweepy library, let’s scrape 100 tweets from a user called <code>john</code> on Twitter. We'll look at the full code implementation that will let us do this and discuss it in detail so we can grasp what’s going on:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tweepy

consumer_key = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your API/Consumer key </span>
consumer_secret = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your API/Consumer Secret Key</span>
access_token = <span class="hljs-string">"XXXX"</span>    <span class="hljs-comment">#Your Access token key</span>
access_token_secret = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your Access token Secret key</span>

<span class="hljs-comment">#Pass in our twitter API authentication key</span>
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

<span class="hljs-comment">#Instantiate the tweepy API</span>
api = tweepy.API(auth, wait_on_rate_limit=<span class="hljs-literal">True</span>)


username = <span class="hljs-string">"john"</span>
no_of_tweets =<span class="hljs-number">100</span>


<span class="hljs-keyword">try</span>:
    <span class="hljs-comment">#The number of tweets we want to retrieved from the user</span>
    tweets = api.user_timeline(screen_name=username, count=no_of_tweets)

    <span class="hljs-comment">#Pulling Some attributes from the tweet</span>
    attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source,  tweet.text] <span class="hljs-keyword">for</span> tweet <span class="hljs-keyword">in</span> tweets]

    <span class="hljs-comment">#Creation of column list to rename the columns in the dataframe</span>
    columns = [<span class="hljs-string">"Date Created"</span>, <span class="hljs-string">"Number of Likes"</span>, <span class="hljs-string">"Source of Tweet"</span>, <span class="hljs-string">"Tweet"</span>]

    <span class="hljs-comment">#Creation of Dataframe</span>
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
<span class="hljs-keyword">except</span> BaseException <span class="hljs-keyword">as</span> e:
    print(<span class="hljs-string">'Status Failed On,'</span>,str(e))
    time.sleep(<span class="hljs-number">3</span>)
</code></pre>
<p>Now let's go over each part of the code in the above block.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tweepy

consumer_key = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your API/Consumer key </span>
consumer_secret = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your API/Consumer Secret Key</span>
access_token = <span class="hljs-string">"XXXX"</span>    <span class="hljs-comment">#Your Access token key</span>
access_token_secret = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your Access token Secret key</span>

<span class="hljs-comment">#Pass in our twitter API authentication key</span>
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

<span class="hljs-comment">#Instantiate the tweepy API</span>
api = tweepy.API(auth, wait_on_rate_limit=<span class="hljs-literal">True</span>)
</code></pre>
<p>In the above code, we've imported the Tweepy library into our code, then we've created some variables where we store our Twitter credentials (The Tweepy authentication handler requires four of our Twitter credentials). So we then pass in those variable into the Tweepy authentication handler and save them into another variable.</p>
<p>Then the last statement of call is where we instantiated the Tweepy API and passed in the require parameters.</p>
<pre><code class="lang-python">username = <span class="hljs-string">"john"</span>
no_of_tweets =<span class="hljs-number">100</span>


<span class="hljs-keyword">try</span>:
    <span class="hljs-comment">#The number of tweets we want to retrieved from the user</span>
    tweets = api.user_timeline(screen_name=username, count=no_of_tweets)

    <span class="hljs-comment">#Pulling Some attributes from the tweet</span>
    attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source,  tweet.text] <span class="hljs-keyword">for</span> tweet <span class="hljs-keyword">in</span> tweets]

    <span class="hljs-comment">#Creation of column list to rename the columns in the dataframe</span>
    columns = [<span class="hljs-string">"Date Created"</span>, <span class="hljs-string">"Number of Likes"</span>, <span class="hljs-string">"Source of Tweet"</span>, <span class="hljs-string">"Tweet"</span>]

    <span class="hljs-comment">#Creation of Dataframe</span>
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
<span class="hljs-keyword">except</span> BaseException <span class="hljs-keyword">as</span> e:
    print(<span class="hljs-string">'Status Failed On,'</span>,str(e))
</code></pre>
<p>In the above code, we created the name of the user (the @name in Twitter) we want to retrieved the tweets from and also the number of tweets. We then created an exception handler to help us catch errors in a more effective way.</p>
<p>After that, the <code>api.user_timeline()</code> returns a collection of the most recent tweets posted by the user we picked in the <code>screen_name</code> parameter and the number of tweets you want to retrieve.</p>
<p>In the next line of code, we passed in some attributes we want to retrieve from each tweet and saved them into a list. To see more attributes you can retrieve from a tweet, read <a target="_blank" href="https://developer.twitter.com/en/docs/twitter-api/v1/tweets/timelines/api-reference/get-statuses-user_timeline">this</a>.</p>
<p>In the last chunk of code we created a dataframe and passed in the list we created along with the names of the column we created.</p>
<p>Note that the column names must be in the sequence of how you passed them into the attributes container (that is, how you passed those attributes in a list when you were retrieving the attributes from the tweet).</p>
<p>If you correctly followed the steps I described, you should have something like this:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/image-17.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Image by Author</em></p>
<p>Now that we are done, let's go over one more example before we move into the Snscrape implementation.</p>
<h3 id="heading-how-to-scrape-tweets-from-a-text-search">How to Scrape Tweets from a Text Search</h3>
<p>In this method, we will be retrieving a tweet based on a search. You can do that like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tweepy

consumer_key = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your API/Consumer key </span>
consumer_secret = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your API/Consumer Secret Key</span>
access_token = <span class="hljs-string">"XXXX"</span>    <span class="hljs-comment">#Your Access token key</span>
access_token_secret = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your Access token Secret key</span>

<span class="hljs-comment">#Pass in our twitter API authentication key</span>
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

<span class="hljs-comment">#Instantiate the tweepy API</span>
api = tweepy.API(auth, wait_on_rate_limit=<span class="hljs-literal">True</span>)


search_query = <span class="hljs-string">"sex for grades"</span>
no_of_tweets =<span class="hljs-number">150</span>


<span class="hljs-keyword">try</span>:
    <span class="hljs-comment">#The number of tweets we want to retrieved from the search</span>
    tweets = api.search_tweets(q=search_query, count=no_of_tweets)

    <span class="hljs-comment">#Pulling Some attributes from the tweet</span>
    attributes_container = [[tweet.user.name, tweet.created_at, tweet.favorite_count, tweet.source,  tweet.text] <span class="hljs-keyword">for</span> tweet <span class="hljs-keyword">in</span> tweets]

    <span class="hljs-comment">#Creation of column list to rename the columns in the dataframe</span>
    columns = [<span class="hljs-string">"User"</span>, <span class="hljs-string">"Date Created"</span>, <span class="hljs-string">"Number of Likes"</span>, <span class="hljs-string">"Source of Tweet"</span>, <span class="hljs-string">"Tweet"</span>]

    <span class="hljs-comment">#Creation of Dataframe</span>
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
<span class="hljs-keyword">except</span> BaseException <span class="hljs-keyword">as</span> e:
    print(<span class="hljs-string">'Status Failed On,'</span>,str(e))
</code></pre>
<p>The above code is similar to the previous code, except that we changed the API method from <code>api.user_timeline()</code> to <code>api.search_tweets()</code>. We've also added <code>tweet.user.name</code> to the attributes container list.</p>
<p>In the code above, you can see that we passed in two attributes. This is because if we only pass in <code>tweet.user</code>, it would only return a dictionary user object. So we must also pass in another attribute we want to retrieve from the user object, which is <code>name</code>.</p>
<p>You can go <a target="_blank" href="https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/user">here</a> to see a list of additional attributes that you can retrieve from a user object. Now you should see something like this once you run it:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/image-18.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Image by Author.</em></p>
<p>Alright, that just about wraps up the Tweepy implementation. Just remember that there is a limit to the number of tweets you can retrieve, and you can not retrieve tweets more than 7 days old using Tweepy.</p>
<h2 id="heading-how-to-use-snscrape-to-scrape-tweets">How to Use Snscrape to Scrape Tweets</h2>
<p>As I mentioned previously, Snscrape does not require Twitter credentials (API key) to access it. There is also no limit to the number of tweets you can fetch.</p>
<p>For this example, though, we'll just retrieve the same tweets as in the previous example, but using Snscrape instead.</p>
<p>To use Snscrape, we must first install its library on our PC. You can do that by typing:</p>
<pre><code class="lang-javascript">pip3 install git+https:<span class="hljs-comment">//github.com/JustAnotherArchivist/snscrape.git</span>
</code></pre>
<h3 id="heading-how-to-scrape-tweets-from-a-user-with-snscrape">How to Scrape Tweets from a User with Snscrape</h3>
<p>Snscrape includes two methods for getting tweets from Twitter: the command line interface (CLI) and a Python Wrapper. Just keep in mind that the Python Wrapper is currently undocumented – but we can still get by with trial and error.</p>
<p>In this example, we will use the Python Wrapper because it is more intuitive than the CLI method. But if you get stuck with some code, you can always turn to the GitHub community for assistance. The contributors will be happy to help you.</p>
<p>To retrieve tweets from a particular user, we can do the following:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> snscrape.modules.twitter <span class="hljs-keyword">as</span> sntwitter
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># Created a list to append all tweet attributes(data)</span>
attributes_container = []

<span class="hljs-comment"># Using TwitterSearchScraper to scrape data and append tweets to list</span>
<span class="hljs-keyword">for</span> i,tweet <span class="hljs-keyword">in</span> enumerate(sntwitter.TwitterSearchScraper(<span class="hljs-string">'from:john'</span>).get_items()):
    <span class="hljs-keyword">if</span> i&gt;<span class="hljs-number">100</span>:
        <span class="hljs-keyword">break</span>
    attributes_container.append([tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])

<span class="hljs-comment"># Creating a dataframe from the tweets list above </span>
tweets_df = pd.DataFrame(attributes_container, columns=[<span class="hljs-string">"Date Created"</span>, <span class="hljs-string">"Number of Likes"</span>, <span class="hljs-string">"Source of Tweet"</span>, <span class="hljs-string">"Tweets"</span>])
</code></pre>
<p>Let's go over some of the code that you might not understand at first glance:</p>
<pre><code class="lang-python"><span class="hljs-keyword">for</span> i,tweet <span class="hljs-keyword">in</span> enumerate(sntwitter.TwitterSearchScraper(<span class="hljs-string">'from:john'</span>).get_items()):
    <span class="hljs-keyword">if</span> i&gt;<span class="hljs-number">100</span>:
        <span class="hljs-keyword">break</span>
    attributes_container.append([tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])


<span class="hljs-comment"># Creating a dataframe from the tweets list above </span>
tweets_df = pd.DataFrame(attributes_container, columns=[<span class="hljs-string">"Date Created"</span>, <span class="hljs-string">"Number of Likes"</span>, <span class="hljs-string">"Source of Tweet"</span>, <span class="hljs-string">"Tweets"</span>])
</code></pre>
<p>In the above code, what the <code>sntwitter.TwitterSearchScaper</code> does is return an object of tweets from the name of the user we passed into it (which is john).</p>
<p>As I mentioned earlier, Snscrape does not have limits on numbers of tweets so it will return however many tweets from that user. To help with this, we need to add the enumerate function which will iterate through the object and add a counter so we can access the most recent 100 tweets from the user.</p>
<p>You can see that the attributes syntax we get from each tweet looks like the one from Tweepy. These are the list of attributes that we can get from the Snscrape tweet which was curated by Martin Beck.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/Sns.Scrape.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Credit: Martin Beck</em></p>
<p>More attributes might be added, as the Snscrape library is still in development. Like for instance in the above image, <code>source</code> has been replaced with <code>sourceLabel</code>. If you pass in only <code>source</code> it will return an object.</p>
<p>If you run the above code, you should see something like this as well:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/image-19.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Image by Author</em></p>
<p>Now let's do the same for scraping by search.</p>
<h3 id="heading-how-to-scrape-tweets-from-a-text-search-with-snscrape">How to Scrape Tweets from a Text Search with Snscrape</h3>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> snscrape.modules.twitter <span class="hljs-keyword">as</span> sntwitter
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># Creating list to append tweet data to</span>
attributes_container = []

<span class="hljs-comment"># Using TwitterSearchScraper to scrape data and append tweets to list</span>
<span class="hljs-keyword">for</span> i,tweet <span class="hljs-keyword">in</span> enumerate(sntwitter.TwitterSearchScraper(<span class="hljs-string">'sex for grades since:2021-07-05 until:2022-07-06'</span>).get_items()):
    <span class="hljs-keyword">if</span> i&gt;<span class="hljs-number">150</span>:
        <span class="hljs-keyword">break</span>
    attributes_container.append([tweet.user.username, tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])

<span class="hljs-comment"># Creating a dataframe to load the list</span>
tweets_df = pd.DataFrame(attributes_container, columns=[<span class="hljs-string">"User"</span>, <span class="hljs-string">"Date Created"</span>, <span class="hljs-string">"Number of Likes"</span>, <span class="hljs-string">"Source of Tweet"</span>, <span class="hljs-string">"Tweet"</span>])
</code></pre>
<p>Again, you can access a lot of historical data using Snscrape (unlike Tweepy, as its standard API cannot exceed 7 days. The premium API is 30 days.). So we can pass in the date from which we want to start the search and the date we want it to end in the <code>sntwitter.TwitterSearchScraper()</code> method.</p>
<p>What we've done in the preceding code is basically what we discussed before. The only thing to bear in mind is that until works similarly to the range function in Python (that is, it excludes the last integer). So if you want to get tweets from today, you need to include the day after today in the "until" parameter.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/image-21.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Image of Author.</em></p>
<p>Now you know how to scrape tweets with Snscrape, too!</p>
<h3 id="heading-when-to-use-each-approach">When to use each approach</h3>
<p>Now that we've seen how each method works, you might be wondering when to use which.</p>
<p>Well, there is no universal rule for when to utilize each method. Everything comes down to a matter preference and your use case.</p>
<p>If you want to acquire an endless number of tweets, you should use Snscrape. But if you want to use extra features that Snscrape cannot provide (like geolocation, for example), then you should definitely use Tweepy. It is directly integrated with the Twitter API and provides complete functionality.</p>
<p>Even so, Snscrape is the most commonly used method for basic scraping.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>In this article, we learned how to scrape data from Python using Tweepy and Snscrape. But this was only a brief overview of how each approach works. You can learn more by exploring the web for additional information.</p>
<p>I've included some useful resources that you can use if you need additional information. Thank you for reading.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/JustAnotherArchivist/snscrape">https://github.com/JustAnotherArchivist/snscrape</a></div>
<p> </p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://docs.tweepy.org/en/stable/index.html">https://docs.tweepy.org/en/stable/index.html</a></div>
<p> </p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://betterprogramming.pub/how-to-scrape-tweets-with-snscrape-90124ed006af">https://betterprogramming.pub/how-to-scrape-tweets-with-snscrape-90124ed006af</a></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Python Project – How to Create a Horoscope API with Beautiful Soup and Flask ]]>
                </title>
                <description>
                    <![CDATA[ Have you ever read your horoscope in the newspaper or seen it on television? Well, I'm not sure about other countries, but in my country of India, people still read their horoscopes.  And this is where I got the idea for this tutorial. It might sound... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/python-project-build-an-api-with-beautiful-soup-and-flask/</link>
                <guid isPermaLink="false">66ba0eb47282cc17abcf0c53</guid>
                
                    <category>
                        <![CDATA[ api ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Flask Framework ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ web scraping ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ashutosh Krishna ]]>
                </dc:creator>
                <pubDate>Fri, 17 Dec 2021 18:29:53 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/12/horoscope-api-1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Have you ever read your horoscope in the newspaper or seen it on television? Well, I'm not sure about other countries, but in my country of India, people still read their horoscopes. </p>
<p>And this is where I got the idea for this tutorial. It might sound a bit old-fashioned, but the main focus here is not on the horoscope itself – it's just the vehicle for our learning. </p>
<p>In this article, we're going to scrape a website called <a target="_blank" href="https://www.horoscope.com/us/index.aspx">Horoscope.com</a> using Beautiful Soup and then create our own API using Flask. This API, if deployed on a public server, can then be used by other developers who would wish to create a website to show their horoscope or an app for the same.</p>
<h2 id="heading-how-to-set-up-the-project">How to Set Up the Project</h2>
<p>First of all, we're going to create a virtual environment within which we'll install all the required dependencies. </p>
<p>Python now ships with the pre-installed <code>venv</code> library. So, to create a virtual environment, you can use the below command:</p>
<pre><code class="lang-bash">$ python -m venv env
</code></pre>
<p>To activate the virtual environment named <code>env</code>, use the command:</p>
<ul>
<li>On Windows:</li>
</ul>
<pre><code class="lang-terminal">env\Scripts\activate.bat
</code></pre>
<ul>
<li>On Linux and MacOS:</li>
</ul>
<pre><code class="lang-bash"><span class="hljs-built_in">source</span> env/bin/activate
</code></pre>
<p>To deactivate the environment (not required at this stage):</p>
<pre><code class="lang-bash">deactivate
</code></pre>
<p>Now we're ready to install the dependencies. The modules and libraries we are going to use in this project are:</p>
<ul>
    <li>requests:&nbsp;<a href="https://docs.python-requests.org/en/latest/">Requests</a>&nbsp;allow you to send HTTP/1.1 requests extremely easily. The module doesn't come pre-installed with Python, so we need to install it using the command:

    <pre><code class="language-bash">$ pip install requests</code></pre>
    </li>
    <li>bs4:&nbsp;<a href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a>(bs4)&nbsp;is a Python library for pulling data out of HTML and XML files.&nbsp;The module doesn't come pre-installed with Python, so we need to install it using the command:
    <pre><code class="language-bash">$ pip install bs4</code></pre>
    </li>
    <li>Flask:&nbsp;<a href="https://flask.palletsprojects.com/">Flask</a> is a simple, easy-to-use microframework for Python that can help build scalable and secure web applications. The module doesn't come pre-installed with Python, so we need to install it using the command:
    <pre><code class="language-bash">$ pip install flask</code></pre>
    </li>
    <li>Flask-RESTX:&nbsp;<a href="https://flask-restx.readthedocs.io/en/latest/quickstart.html">Flask-RESTX</a> lets you create APIs with Swagger Documentation. The module doesn't come pre-installed with Python, so we need to install it using the command:
    <pre><code class="language-bash">$ pip install flask-restx</code></pre>
    </li>
</ul>

<p>We'll also use environment variables in this project. So, we are going to install another module called <strong>python-decouple</strong> to handle this:</p>
<pre><code class="lang-bash">pip install python-decouple
</code></pre>
<p>To learn more about environment variables in Python, you can check out <a target="_blank" href="https://iread.ga/posts/49/do-you-really-need-environment-variables-in-python">this article</a>.</p>
<h2 id="heading-project-workflow">Project Workflow</h2>
<p>The basic workflow of the project will be like this:</p>
<ol>
<li>The horoscope data will be scraped from <a target="_blank" href="https://www.horoscope.com/us/index.aspx">Horoscope.com</a>.</li>
<li>The data will then be used by our Flask server to send JSON response to the user.</li>
</ol>
<h2 id="heading-how-to-set-up-a-flask-project">How to Set Up a Flask Project</h2>
<p>The first thing we're going to do is to create a Flask project. If you check the <a target="_blank" href="https://flask.palletsprojects.com/en/2.0.x/quickstart/">official documentation</a> of Flask, you'll find a <a target="_blank" href="https://flask.palletsprojects.com/en/2.0.x/quickstart/#a-minimal-application">minimal application</a> there. </p>
<p>But, we're not going to follow that. We are going to write an application that is more extensible and has a good base structure. If you wish, you can follow <a target="_blank" href="https://iread.ga/posts/54/getting-started-with-flask">this guide</a> to get started with Flask.</p>
<p>Our application will exist within a package called <strong>core</strong>. To convert a usual directory to a Python package, we just need to include an <code>__init__.py</code> file. So, let's create our core package first.</p>
<pre><code class="lang-bash">$ mkdir core
</code></pre>
<p>After that, let's create the <code>__init__.py</code> file inside the core directory:</p>
<pre><code class="lang-bash">$ <span class="hljs-built_in">cd</span> core
$ touch __init__.py
$ <span class="hljs-built_in">cd</span> ..
</code></pre>
<p>In the root directory of the project, create a file called <code>config.py</code>. We'll store the configurations for the project in this file. Within the file, add the following content:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> decouple <span class="hljs-keyword">import</span> config


<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Config</span>(<span class="hljs-params">object</span>):</span>
    SECRET_KEY = config(<span class="hljs-string">'SECRET_KEY'</span>, default=<span class="hljs-string">'guess-me'</span>)
    DEBUG = <span class="hljs-literal">False</span>
    TESTING = <span class="hljs-literal">False</span>
    CSRF_ENABLED = <span class="hljs-literal">True</span>


<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ProductionConfig</span>(<span class="hljs-params">Config</span>):</span>
    DEBUG = <span class="hljs-literal">False</span>
    MAIL_DEBUG = <span class="hljs-literal">False</span>


<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">StagingConfig</span>(<span class="hljs-params">Config</span>):</span>
    DEVELOPMENT = <span class="hljs-literal">True</span>
    DEBUG = <span class="hljs-literal">True</span>


<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">DevelopmentConfig</span>(<span class="hljs-params">Config</span>):</span>
    DEVELOPMENT = <span class="hljs-literal">True</span>
    DEBUG = <span class="hljs-literal">True</span>


<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">TestingConfig</span>(<span class="hljs-params">Config</span>):</span>
    TESTING = <span class="hljs-literal">True</span>
</code></pre>
<p>In the above script, we have created a <em>Config</em> class and defined various attributes inside that. Also, we have created different child classes (as per different stages of development) that inherit the <em>Config</em> class.</p>
<p>Notice that we have the SECRET_KEY set to an environment variable named <strong>SECRET_KEY</strong>. Create a file named <code>.env</code> in the root directory and add the following content there:</p>
<pre><code class="lang-env">APP_SETTINGS=config.DevelopmentConfig
SECRET_KEY=gufldksfjsdf
</code></pre>
<p>Apart from <strong>SECRET_KEY</strong>, we have <strong>APP_SETTINGS</strong> that refers to one of the classes we created in the <code>config.py</code> file. We set it to the current stage of the project.</p>
<p>Now, we can add the following content in the <code>__init__.py</code> file:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> flask <span class="hljs-keyword">import</span> Flask
<span class="hljs-keyword">from</span> decouple <span class="hljs-keyword">import</span> config
<span class="hljs-keyword">from</span> flask_restx <span class="hljs-keyword">import</span> Api

app = Flask(__name__)
app.config.from_object(config(<span class="hljs-string">"APP_SETTINGS"</span>))
api = Api(
    app,
    version=<span class="hljs-string">'1.0'</span>,
    title=<span class="hljs-string">'Horoscope API'</span>,
    description=<span class="hljs-string">'Get horoscope data easily using the below APIs'</span>,
    license=<span class="hljs-string">'MIT'</span>,
    contact=<span class="hljs-string">'Ashutosh Krishna'</span>,
    contact_url=<span class="hljs-string">'https://ashutoshkrris.tk'</span>,
    contact_email=<span class="hljs-string">'contact@ashutoshkrris.tk'</span>,
    doc=<span class="hljs-string">'/'</span>,
    prefix=<span class="hljs-string">'/api/v1'</span>
)
</code></pre>
<p>In the above Python script, we are first importing the Flask class from the Flask module that we have installed. Next, we're creating an object <code>app</code> of class Flask. We use the <code>__name__</code> argument to indicate the app's module or package, so that Flask knows where to find other files such as templates. </p>
<p>Next we are setting the app configurations to the <strong>APP_SETTINGS</strong> according to the variable in the <code>.env</code> file. </p>
<p>Apart from that, we have created an object of the <em>Api</em> class. We need to pass various arguments to it. We can find the Swagger documentation on the <code>/</code> route. The <code>/api/v1</code> will be prefixed on each API route. </p>
<p>For now, let's create a <code>routes.py</code> file in the <code>core</code> package and just add the following namespace:</p>
<pre><code class="lang-py"><span class="hljs-keyword">from</span> core <span class="hljs-keyword">import</span> api
<span class="hljs-keyword">from</span> flask <span class="hljs-keyword">import</span> jsonify

ns = api.namespace(<span class="hljs-string">'/'</span>, description=<span class="hljs-string">'Horoscope APIs'</span>)
</code></pre>
<p>We need to import the routes in the <code>__init__.py</code> file:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> flask <span class="hljs-keyword">import</span> Flask
<span class="hljs-keyword">from</span> decouple <span class="hljs-keyword">import</span> config
<span class="hljs-keyword">from</span> flask_restx <span class="hljs-keyword">import</span> Api

app = Flask(__name__)
app.config.from_object(config(<span class="hljs-string">"APP_SETTINGS"</span>))
api = Api(
    app,
    version=<span class="hljs-string">'1.0'</span>,
    title=<span class="hljs-string">'Horoscope API'</span>,
    description=<span class="hljs-string">'Get horoscope data easily using the below APIs'</span>,
    license=<span class="hljs-string">'MIT'</span>,
    contact=<span class="hljs-string">'Ashutosh Krishna'</span>,
    contact_url=<span class="hljs-string">'https://ashutoshkrris.tk'</span>,
    contact_email=<span class="hljs-string">'contact@ashutoshkrris.tk'</span>,
    doc=<span class="hljs-string">'/'</span>,
    prefix=<span class="hljs-string">'/api/v1'</span>
)

<span class="hljs-keyword">from</span> core <span class="hljs-keyword">import</span> routes            <span class="hljs-comment"># Add this line</span>
</code></pre>
<p>We're now just left with one file which will help us run the Flask server:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> core <span class="hljs-keyword">import</span> app

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    app.run()
</code></pre>
<p>Once you run this file using the <code>python main.py</code> command, you'll see a similar output:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/12/Screenshot-2021-12-16-160820.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Now, we are ready to scrape the data from the Horoscope website.</p>
<h2 id="heading-how-to-scrape-the-data-from-horoscopecom">How to Scrape the Data from Horoscope.com</h2>
<p>If you open <a target="_blank" href="https://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-today.aspx">Horoscope.com</a> and choose your zodiac sign, the horoscope data for your zodiac sign for today will be shown.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/12/Screenshot-2021-12-16-073450.png" alt="Image" width="600" height="400" loading="lazy">
<em>Source: Horoscope.com</em></p>
<p>In the above image, you can see you can view the horoscope for yesterday, tomorrow, weekly, monthly or even a custom date. We're going to use all of these.</p>
<p>But first if you see the URL of the current page, it is something like: <a target="_blank" href="https://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-today.aspx?sign=10">https://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-today.aspx?sign=10</a> . </p>
<p>The URL has two variables, if you see clearly, <strong>sign</strong> and <strong>today</strong>. The value of variable <strong>sign</strong> will be assigned according to the zodiac sign. The variable <strong>today</strong> can be replaced with <strong>yesterday</strong> and <strong>tomorrow</strong>.</p>
<p>The dictionary below can help us with the zodiac signs:</p>
<pre><code class="lang-python">ZODIAC_SIGNS = {
    <span class="hljs-string">"Aries"</span>: <span class="hljs-number">1</span>,
    <span class="hljs-string">"Taurus"</span>: <span class="hljs-number">2</span>,
    <span class="hljs-string">"Gemini"</span>: <span class="hljs-number">3</span>,
    <span class="hljs-string">"Cancer"</span>: <span class="hljs-number">4</span>,
    <span class="hljs-string">"Leo"</span>: <span class="hljs-number">5</span>,
    <span class="hljs-string">"Virgo"</span>: <span class="hljs-number">6</span>,
    <span class="hljs-string">"Libra"</span>: <span class="hljs-number">7</span>,
    <span class="hljs-string">"Scorpio"</span>: <span class="hljs-number">8</span>,
    <span class="hljs-string">"Sagittarius"</span>: <span class="hljs-number">9</span>,
    <span class="hljs-string">"Capricorn"</span>: <span class="hljs-number">10</span>,
    <span class="hljs-string">"Aquarius"</span>: <span class="hljs-number">11</span>,
    <span class="hljs-string">"Pisces"</span>: <span class="hljs-number">12</span>
}
</code></pre>
<p>This means that if your zodiac sign is <strong>Capricorn</strong>, the value of <strong>sign</strong> in the URL will be <strong>10</strong>. </p>
<p>Next, if we wish to get the horoscope data for a custom date, the URL <a target="_blank" href="https://www.horoscope.com/us/horoscopes/general/horoscope-archive.aspx?sign=10&amp;laDate=20211213">https://www.horoscope.com/us/horoscopes/general/horoscope-archive.aspx?sign=10&amp;laDate=20211213</a> will help us. </p>
<p>It has the same <strong>sign</strong> variable, but it has another variable <strong>laDate</strong> which takes the date in <strong>YYYYMMDD</strong> format. </p>
<p>Now, we're ready to create different functions to fetch horoscope data. Create a <code>utils.py</code> file and follow along.</p>
<h3 id="heading-howe-to-get-a-horoscope-for-the-day">Howe to Get a Horoscope for the Day</h3>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_horoscope_by_day</span>(<span class="hljs-params">zodiac_sign: int, day: str</span>):</span>
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> <span class="hljs-string">"-"</span> <span class="hljs-keyword">in</span> day:
        res = requests.get(<span class="hljs-string">f"https://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-<span class="hljs-subst">{day}</span>.aspx?sign=<span class="hljs-subst">{zodiac_sign}</span>"</span>)
    <span class="hljs-keyword">else</span>:
        day = day.replace(<span class="hljs-string">"-"</span>, <span class="hljs-string">""</span>)
        res = requests.get(<span class="hljs-string">f"https://www.horoscope.com/us/horoscopes/general/horoscope-archive.aspx?sign=<span class="hljs-subst">{zodiac_sign}</span>&amp;laDate=<span class="hljs-subst">{day}</span>"</span>)
    soup = BeautifulSoup(res.content, <span class="hljs-string">'html.parser'</span>)
    data = soup.find(<span class="hljs-string">'div'</span>, attrs={<span class="hljs-string">'class'</span>: <span class="hljs-string">'main-horoscope'</span>})
    <span class="hljs-keyword">return</span> data.p.text
</code></pre>
<p>We have created our first function which accepts two arguments – an integer <strong>zodiac_sign</strong> and a string <strong>day</strong>. The day can either be today, tomorrow, yesterday or any custom date before today in the format YYYY-MM-DD. </p>
<p>If the day is not a custom date, it won't have the hyphen(-) symbol in it. So, we have put a condition for the same. </p>
<p>If there is no hyphen symbol, we make a GET request on <code>https://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-{_day_}.aspx?sign={_zodiac_sign_}</code>. Else first, we change the date from YYYY-MM-DD to YYYYMMDD format. </p>
<p>Then we make a GET request on <code>https://www.horoscope.com/us/horoscopes/general/horoscope-archive.aspx?sign={_zodiac_sign_}&amp;laDate={_day_}</code>. </p>
<p>After that, we pull the HTML data from the response content of the page using BeautifulSoup. Now we need to get the horoscope text from this HTML code. If you inspect the code of any of the webpage, you'll find this:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/12/screenshot-2021-12-07-081318_nwhwwf.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The horoscope text is contained in a <strong>div</strong> with the class <strong>main-horoscope</strong>. Thus we use the <code>soup.find()</code> function to extract the paragraph text string, and return it.</p>
<h3 id="heading-how-to-get-a-horoscope-for-the-week">How to Get a Horoscope for the Week</h3>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_horoscope_by_week</span>(<span class="hljs-params">zodiac_sign: int</span>):</span>
    res = requests.get(<span class="hljs-string">f"https://www.horoscope.com/us/horoscopes/general/horoscope-general-weekly.aspx?sign=<span class="hljs-subst">{zodiac_sign}</span>"</span>)
    soup = BeautifulSoup(res.content, <span class="hljs-string">'html.parser'</span>)
    data = soup.find(<span class="hljs-string">'div'</span>, attrs={<span class="hljs-string">'class'</span>: <span class="hljs-string">'main-horoscope'</span>})
    <span class="hljs-keyword">return</span> data.p.text
</code></pre>
<p>The above function is quite similar to the previous one. We have just changed the URL to <code>https://www.horoscope.com/us/horoscopes/general/horoscope-general-weekly.aspx?sign={_zodiac_sign_}</code>.</p>
<h3 id="heading-how-to-get-a-horoscope-for-the-month">How to Get a Horoscope for the Month</h3>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_horoscope_by_month</span>(<span class="hljs-params">zodiac_sign: int</span>):</span>
    res = requests.get(<span class="hljs-string">f"https://www.horoscope.com/us/horoscopes/general/horoscope-general-monthly.aspx?sign=<span class="hljs-subst">{zodiac_sign}</span>"</span>)
    soup = BeautifulSoup(res.content, <span class="hljs-string">'html.parser'</span>)
    data = soup.find(<span class="hljs-string">'div'</span>, attrs={<span class="hljs-string">'class'</span>: <span class="hljs-string">'main-horoscope'</span>})
    <span class="hljs-keyword">return</span> data.p.text
</code></pre>
<p>This function is also similar to the other two except the URL which has now been changed to <code>https://www.horoscope.com/us/horoscopes/general/horoscope-general-monthly.aspx?sign={_zodiac_sign_}</code>.</p>
<h2 id="heading-how-to-create-api-routes">How to Create API Routes</h2>
<p>We'll be using Flask-RESTX to create our API routes. The API routes will look like these:</p>
<ul>
<li>For daily or custom dates:<code>/api/v1/get-horoscope/daily?day=today&amp;sign=capricorn</code> or <code>api/v1/get-horoscope/daily?day=2022-12-14&amp;sign=capricorn</code></li>
<li>For weekly: <code>api/v1/get-horoscope/weekly?sign=capricorn</code></li>
<li>For monthly: <code>api/v1/get-horoscope/monthly?sign=capricorn</code></li>
</ul>
<p>We have two query parameters in the URLs: <strong>day</strong> and <strong>sign</strong>. The <strong>day</strong> parameter can take values like today, yesterday, or custom dates like 2022-12-14. The <strong>sign</strong> parameter will take the zodiac sign name which can be in uppercase or lowercase, it won't matter.</p>
<p>To parse the query parameters from the URL, Flask-RESTX has built-in support for request data validation using a library similar to <a target="_blank" href="https://docs.python.org/3/library/argparse.html#module-argparse"><strong>argparse</strong></a> called <strong>reqparse</strong>. To add arguments in the URL, we'll use <strong>add_argument</strong> method of the <em>RequestParser</em> class.</p>
<pre><code class="lang-python">parser = reqparse.RequestParser()
parser.add_argument(<span class="hljs-string">'sign'</span>, type=str, required=<span class="hljs-literal">True</span>)
</code></pre>
<p>The <code>type</code> parameter will take the type of parameter. The <code>required=True</code> makes the query parameter mandatory to be passed. </p>
<p>Now, we need another query parameter day. But this parameter will be used only in the daily horoscope URL. </p>
<p>Instead of rewriting arguments we can write a parent parser containing all the shared arguments and then extend the parser with <a target="_blank" href="https://flask-restplus.readthedocs.io/en/stable/api.html#flask_restplus.reqparse.RequestParser.copy"><code>copy()</code></a>.</p>
<pre><code class="lang-python">parser_copy = parser.copy()
parser_copy.add_argument(<span class="hljs-string">'day'</span>, type=str, required=<span class="hljs-literal">True</span>)
</code></pre>
<p>The <code>parser_copy</code> will not only contain <strong>day</strong>, but also <strong>sign</strong>. That is what we'll require for the daily horoscope.</p>
<p>The main building blocks provided by Flask-RESTX are resources. Resources are built on top of <a target="_blank" href="https://flask.palletsprojects.com/en/2.0.x/views/">Flask pluggable views</a>, giving you easy access to multiple HTTP methods just by defining methods on your resource. </p>
<p>Let's create the <em>DailyHoroscopeAPI</em> class that inherits the <em>Resource</em> class from <code>flask_restx</code>.</p>
<pre><code class="lang-python"><span class="hljs-meta">@ns.route('/get-horoscope/daily')</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">DailyHoroscopeAPI</span>(<span class="hljs-params">Resource</span>):</span>
    <span class="hljs-string">'''Shows daily horoscope of zodiac signs'''</span>
<span class="hljs-meta">    @ns.doc(parser=parser_copy)</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get</span>(<span class="hljs-params">self</span>):</span>
        args = parser_copy.parse_args()
        day = args.get(<span class="hljs-string">'day'</span>)
        zodiac_sign = args.get(<span class="hljs-string">'sign'</span>)
        <span class="hljs-keyword">try</span>:
            zodiac_num = ZODIAC_SIGNS[zodiac_sign.capitalize()]
            <span class="hljs-keyword">if</span> <span class="hljs-string">"-"</span> <span class="hljs-keyword">in</span> day:
                datetime.strptime(day, <span class="hljs-string">'%Y-%m-%d'</span>)
            horoscope_data = get_horoscope_by_day(zodiac_num, day)
            <span class="hljs-keyword">return</span> jsonify(success=<span class="hljs-literal">True</span>, data=horoscope_data, status=<span class="hljs-number">200</span>)
        <span class="hljs-keyword">except</span> KeyError:
            <span class="hljs-keyword">raise</span> NotFound(<span class="hljs-string">'No such zodiac sign exists'</span>)
        <span class="hljs-keyword">except</span> AttributeError:
            <span class="hljs-keyword">raise</span> BadRequest(
                <span class="hljs-string">'Something went wrong, please check the URL and the arguments.'</span>)
        <span class="hljs-keyword">except</span> ValueError:
            <span class="hljs-keyword">raise</span> BadRequest(<span class="hljs-string">'Please enter day in correct format: YYYY-MM-DD'</span>)
</code></pre>
<p>The <code>@ns.route()</code> decorator sets the API route. Inside the <em>DailyHoroscopeAPI</em> class, we have the <strong>get</strong> method that will handle the GET requests. The <code>@ns.doc()</code> decorator will help us add the query parameters on the URL. </p>
<p>To get the values of query parameters, we'll use the <strong>parse_args()</strong> method that will return us a dictionary like this:</p>
<pre><code class="lang-bash">{<span class="hljs-string">'sign'</span>: <span class="hljs-string">'capricorn'</span>, <span class="hljs-string">'day'</span>: <span class="hljs-string">'2022-12-14'</span>}
</code></pre>
<p>We can then get the values using the keys <strong>day</strong> and <strong>sign</strong>.</p>
<p>As defined in the beginning, we'll have a ZODIAC_SIGNS dictionary. We use a <strong>try-except</strong> block to handle the request. If the zodiac sign is not in the dictionary, a <em>KeyError</em> Exception is raised. In that case, we respond with a <em>NotFound</em> error (Error 404). </p>
<p>Also, if the <strong>day</strong> parameter has a hyphen in it, we try to match the date format with YYYY-MM-DD. If it's not in that format, we raise a <em>BadRequest</em> error (Error 400). If the <strong>day</strong> doesn't contain a hyphen, we directly call the <code>get_horoscope_by_day()</code> method with the <strong>sign</strong> and <strong>day</strong> arguments. </p>
<p>If some gibberish is passed as the parameter value, an <em>AttributeError</em> is raised. In that case, we raise a <em>BadRequest</em> error.</p>
<p>The other two routes are also quite similar to the above one. The difference is, we don't need a day parameter here. So, instead of using <code>parser_copy</code>, we'll use <code>parser</code> here.</p>
<pre><code class="lang-python"><span class="hljs-meta">@ns.route('/get-horoscope/weekly')</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">WeeklyHoroscopeAPI</span>(<span class="hljs-params">Resource</span>):</span>
    <span class="hljs-string">'''Shows weekly horoscope of zodiac signs'''</span>
<span class="hljs-meta">    @ns.doc(parser=parser)</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get</span>(<span class="hljs-params">self</span>):</span>
        args = parser.parse_args()
        zodiac_sign = args.get(<span class="hljs-string">'sign'</span>)
        <span class="hljs-keyword">try</span>:
            zodiac_num = ZODIAC_SIGNS[zodiac_sign.capitalize()]
            horoscope_data = get_horoscope_by_week(zodiac_num)
            <span class="hljs-keyword">return</span> jsonify(success=<span class="hljs-literal">True</span>, data=horoscope_data, status=<span class="hljs-number">200</span>)
        <span class="hljs-keyword">except</span> KeyError:
            <span class="hljs-keyword">raise</span> NotFound(<span class="hljs-string">'No such zodiac sign exists'</span>)
        <span class="hljs-keyword">except</span> AttributeError:
            <span class="hljs-keyword">raise</span> BadRequest(<span class="hljs-string">'Something went wrong, please check the URL and the arguments.'</span>)


<span class="hljs-meta">@ns.route('/get-horoscope/monthly')</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MonthlyHoroscopeAPI</span>(<span class="hljs-params">Resource</span>):</span>
    <span class="hljs-string">'''Shows monthly horoscope of zodiac signs'''</span>
<span class="hljs-meta">    @ns.doc(parser=parser)</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get</span>(<span class="hljs-params">self</span>):</span>
        args = parser.parse_args()
        zodiac_sign = args.get(<span class="hljs-string">'sign'</span>)
        <span class="hljs-keyword">try</span>:
            zodiac_num = ZODIAC_SIGNS[zodiac_sign.capitalize()]
            horoscope_data = get_horoscope_by_month(zodiac_num)
            <span class="hljs-keyword">return</span> jsonify(success=<span class="hljs-literal">True</span>, data=horoscope_data, status=<span class="hljs-number">200</span>)
        <span class="hljs-keyword">except</span> KeyError:
            <span class="hljs-keyword">raise</span> NotFound(<span class="hljs-string">'No such zodiac sign exists'</span>)
        <span class="hljs-keyword">except</span> AttributeError:
            <span class="hljs-keyword">raise</span> BadRequest(<span class="hljs-string">'Something went wrong, please check the URL and the arguments.'</span>)
</code></pre>
<p>Now our routes are done. To test the APIs, you can use the Swagger documentation available on the <code>/</code> route, or you can use <a target="_blank" href="https://www.postman.com/">Postman</a>. Let's run the server and test it.</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/yggJPOqr6jc" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<p>You can also deploy the project on a public server so that other developers can access and use the API too.</p>
<h2 id="heading-wrapping-up">Wrapping up</h2>
<p>In this tutorial, we learned how to scrape data from a website using requests and Beautiful Soup. Then we created an API using Flask and Flask-RESTX. </p>
<p>If you wish to learn how to interact with APIs using Python, check out <a target="_blank" href="https://www.freecodecamp.org/news/how-to-interact-with-web-services-using-python/">this guide</a>.</p>
<p>I hope you enjoyed it – and thanks for reading!</p>
<p>Code for the tutorial: <a target="_blank" href="https://github.com/ashutoshkrris/Horoscope-API">https://github.com/ashutoshkrris/Horoscope-API</a> </p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Python Web Scraping Tutorial – How to Scrape Data From Any Website with Python ]]>
                </title>
                <description>
                    <![CDATA[ By Sorin-Gabriel Marica Web scraping is the process of extracting specific data from the internet automatically. It has many use cases, like getting data for a machine learning project, creating a price comparison tool, or any other innovative idea t... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-2/</link>
                <guid isPermaLink="false">66d4614a51f567b42d9f84d4</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ web scraping ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 10 Aug 2021 17:42:52 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/08/how-to-scrape-data-from-any-website-with-python.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Sorin-Gabriel Marica</p>
<p>Web scraping is the process of extracting specific data from the internet automatically. It has many use cases, like getting data for a machine learning project, creating a price comparison tool, or any other innovative idea that requires an immense amount of data.</p>
<p>While you can theoretically do data extraction manually, the vast contents of the internet makes this approach unrealistic in many cases. So knowing how to build a web scraper can come in handy. </p>
<p>This article’s purpose is to teach you how to create a web scraper in Python. You will learn how to inspect a website to prepare for scraping, extract specific data using BeautifulSoup, wait for JavaScript rendering using Selenium, and save everything in a new JSON or CSV file.</p>
<p>But first, I should warn you about the legality of web scraping. While the act of scraping is legal, the data you may extract can be illegal to use. Make sure that you're not messing with any:</p>
<ul>
<li>Copyrighted content – since it's someone's intellectual property, it's protected by law and you can't just reuse it.</li>
<li>Personal data – if the information you gather can be used to identify a person, then it's considered personal data and for EU citizens, it's protected under the GDPR. Unless you have a lawful reason to store that data, it's better to just skip it altogether.</li>
</ul>
<p>Generally speaking, you should always read a website's terms and conditions before scraping to make sure that you're not going against their policies. If you're ever unsure how to proceed, contact the site owner and ask for consent. </p>
<h2 id="heading-what-will-you-need-for-your-scraper">What Will You Need for Your Scraper?</h2>
<p>To start building your own web scraper, you will first need to have <a target="_blank" href="https://www.python.org/downloads/">Python</a> installed on your machine. Ubuntu 20.04 and other versions of Linux come with Python 3 pre-installed. </p>
<p>To check if you already have Python installed on your device, run the following command:</p>
<pre><code>python3 -v
</code></pre><p>If you have Python installed, you should receive an output like this:</p>
<pre><code>Python <span class="hljs-number">3.8</span><span class="hljs-number">.2</span>
</code></pre><p>Also, for our web scraper, we will use the Python packages BeautifulSoup (for selecting specific data) and Selenium (for rendering dynamically loaded content). To install them, just run these commands:</p>
<pre><code>pip3 install beautifulsoup4
</code></pre><p>and</p>
<pre><code>pip3 install selenium
</code></pre><p>The final step it’s to make sure you <a target="_blank" href="https://support.google.com/chrome/answer/95346?co=GENIE.Platform%3DDesktop&amp;hl=en">install Google Chrome</a> and <a target="_blank" href="https://chromedriver.chromium.org/downloads">Chrome Driver</a> on your machine. These will be necessary if we want to use Selenium to scrape dynamically loaded content.</p>
<h2 id="heading-how-to-inspect-the-page">How to Inspect the Page</h2>
<p>Now that you have everything installed, it’s time to start our scraping project in earnest. </p>
<p>You should choose the website you want to scrape based on your needs. Keep in mind that each website structures its content differently, so you’ll need to adjust what you learn here when you start scraping on your own. Each website will require minor changes to the code.</p>
<p>For this article, I decided to scrape information about the first ten movies from the top 250 movies list from IMDb: <a target="_blank" href="https://www.imdb.com/chart/top/">https://www.imdb.com/chart/top/</a>. </p>
<p>First, we will get the titles, then we will dive in further by extracting information from each movie’s page. Some of the data will require JavaScript rendering.</p>
<p>To start understanding the content’s structure, you should right-click on the first title from the list and then choose “Inspect Element”.</p>
<p><img src="https://lh4.googleusercontent.com/e6DE3zczzQa-VSBIynK-fR4oyAjVbpx2PztpEDKbi3K0NII9_lFkFhGQmiOjc_-Y_Kg26cM3pecnSKNiPlLZGpntqVKUrcX9E4gDWaTsolWoCFzQ6EEhj3GruBvrlEIzrUffvdjU" alt="Image" width="600" height="400" loading="lazy"></p>
<p>By pressing CTRL+F and searching in the HTML code structure, you will see that there is only one <strong><table></table></strong> tag on the page. This is useful as it gives us information about how we can access the data.</p>
<p>An HTML selector that will give us all of the titles from the page is <strong><code>table tbody tr td.titleColumn a</code></strong>. That’s because all titles are in an anchor inside a table cell with the class “titleColumn”. </p>
<p>Using this CSS selector and getting the <strong>innerText</strong> of each anchor will give us the titles that we need. You can simulate that in the browser console from the new window you just opened and by using the JavaScript line:</p>
<pre><code><span class="hljs-built_in">document</span>.querySelectorAll(<span class="hljs-string">"table tbody tr td.titleColumn a"</span>)[<span class="hljs-number">0</span>].innerText
</code></pre><p>You will see something like this:</p>
<p><img src="https://lh4.googleusercontent.com/T1pgLUXJHX_s3gubDKvBjwkWeK1neZxiysoneD2Q1NU3Sj_pD8defdKorTlcsiiqShlmPDEeCu3Goo5T9CgzPKCml9dq_kCCu7KUyTx7uSrU8VN9QzJZhO6AwBM-kfQ8r0uNxbn9" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Now that we have this selector, we can start writing our Python code and extracting the information we need.</p>
<h2 id="heading-how-to-use-beautifulsoup-to-extract-statically-loaded-content">How to Use BeautifulSoup to Extract Statically Loaded Content</h2>
<p>The movie titles from our list are static content. That’s because if you look into the page source (CTRL+U on the page or right-click and then choose View Page Source), you will see that the titles are already there.</p>
<p>Static content is usually easier to scrape as it doesn’t require JavaScript rendering. To extract the first ten titles on the list, we will use BeautifulSoup to get the content and then print it in the output of our scraper.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup

page = requests.get(<span class="hljs-string">'https://www.imdb.com/chart/top/'</span>) <span class="hljs-comment"># Getting page HTML through request</span>
soup = BeautifulSoup(page.content, <span class="hljs-string">'html.parser'</span>) <span class="hljs-comment"># Parsing content using beautifulsoup</span>

links = soup.select(<span class="hljs-string">"table tbody tr td.titleColumn a"</span>) <span class="hljs-comment"># Selecting all of the anchors with titles</span>
first10 = links[:<span class="hljs-number">10</span>] <span class="hljs-comment"># Keep only the first 10 anchors</span>
<span class="hljs-keyword">for</span> anchor <span class="hljs-keyword">in</span> first10:
    print(anchor.text) <span class="hljs-comment"># Display the innerText of each anchor</span>
</code></pre>
<p>The code above uses the selector we saw in the first step to extract the movie title anchors from the page. It then loops through the first ten and displays the innerText of each.</p>
<p>The output should look like this:</p>
<p><img src="https://lh3.googleusercontent.com/RrmEldjCrbz7V1-o4r6UsKNuWkj_yD2cWwfyuMMbdnRn7lk9cI0yhMi85PK4NrvX7L2KY0pY8047f9CmAeXo1W51HvFENMPxxh36ACqu3kNKuoFNNfhB_WSCMntIB-UB0usEU2n5" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-how-to-extract-dynamically-loaded-content">How to Extract Dynamically Loaded Content</h2>
<p>As technology advanced, websites started to load their content dynamically. This improves the page’s performance, the user's experience, and even removes an extra barrier for scrapers.</p>
<p>This complicates things, though, as the HTML retrieved from a simple request will not contain the dynamic content. Fortunately, with Selenium, we can simulate a request in the browser and wait for the dynamic content to be displayed.</p>
<h3 id="heading-how-to-use-selenium-for-requests">How to Use Selenium for Requests</h3>
<p>You will need to know the location of your chromedriver. The following code is identical to the one presented in the second step, but this time we are using Selenium to make the request. We will still parse the page’s content using BeautifulSoup, as we did before.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup
<span class="hljs-keyword">from</span> selenium <span class="hljs-keyword">import</span> webdriver

option = webdriver.ChromeOptions()
<span class="hljs-comment"># I use the following options as my machine is a window subsystem linux. </span>
<span class="hljs-comment"># I recommend to use the headless option at least, out of the 3</span>
option.add_argument(<span class="hljs-string">'--headless'</span>)
option.add_argument(<span class="hljs-string">'--no-sandbox'</span>)
option.add_argument(<span class="hljs-string">'--disable-dev-sh-usage'</span>)
<span class="hljs-comment"># Replace YOUR-PATH-TO-CHROMEDRIVER with your chromedriver location</span>
driver = webdriver.Chrome(<span class="hljs-string">'YOUR-PATH-TO-CHROMEDRIVER'</span>, options=option)

driver.get(<span class="hljs-string">'https://www.imdb.com/chart/top/'</span>) <span class="hljs-comment"># Getting page HTML through request</span>
soup = BeautifulSoup(driver.page_source, <span class="hljs-string">'html.parser'</span>) <span class="hljs-comment"># Parsing content using beautifulsoup. Notice driver.page_source instead of page.content</span>

links = soup.select(<span class="hljs-string">"table tbody tr td.titleColumn a"</span>) <span class="hljs-comment"># Selecting all of the anchors with titles</span>
first10 = links[:<span class="hljs-number">10</span>] <span class="hljs-comment"># Keep only the first 10 anchors</span>
<span class="hljs-keyword">for</span> anchor <span class="hljs-keyword">in</span> first10:
    print(anchor.text) <span class="hljs-comment"># Display the innerText of each anchor</span>
</code></pre>
<p>Don’t forget to replace “YOUR-PATH-TO-CHROMEDRIVER” with the location where you extracted the chromedriver. Also, you should notice that instead of <strong><code>page.content</code></strong>, when we are creating the BeautifulSoup object, we are now using <strong><code>driver.page_source</code></strong>, which provides the HTML content of the page.</p>
<h3 id="heading-how-to-extract-statically-loaded-content-using-selenium">How to Extract Statically Loaded Content Using Selenium</h3>
<p>Using the code from above, we can now access each movie page by calling the click method on each of the anchors.</p>
<pre><code class="lang-python">first_link = driver.find_elements_by_css_selector(<span class="hljs-string">'table tbody tr td.titleColumn a'</span>)[<span class="hljs-number">0</span>]
first_link.click()
</code></pre>
<p>This will simulate a click on the first movie’s link. However, in this case, I recommend that you continue using <strong><code>driver.get instead</code></strong>. This is because you will no longer be able to use the <strong><code>click()</code></strong> method after you go on a different page since the new page doesn't have links to the other nine movies.</p>
<p>As a result, after clicking on the first title from the list, you’d need to go back to the first page, then click on the second, and so on. This is a waste of performance and time. Instead, we will just use the extracted links and access them one by one.</p>
<p>For “The Shawshank Redemption”, the movie page will be <a target="_blank" href="https://www.imdb.com/title/tt0111161/">https://www.imdb.com/title/tt0111161/</a>. We will extract the movie’s year and duration from the page, but this time we will use Selenium’s functions instead of BeautifulSoup as an example. In practice, you can use either one, so pick your favorite.</p>
<p>To retrieve the movie’s year and duration, you should repeat the first step we went through here on the movie’s page. </p>
<p>You will notice that you can find all of the information in the first element with the class <strong><code>ipc-inline-list</code></strong> (".ipc-inline-list" selector) and that all of the elements of the list contain the attribute <strong><code>role</code></strong> with the value <strong><code>presentation</code></strong> (the <code>[role=’presentation’]</code> selector).</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup
<span class="hljs-keyword">from</span> selenium <span class="hljs-keyword">import</span> webdriver

option = webdriver.ChromeOptions()
<span class="hljs-comment"># I use the following options as my machine is a window subsystem linux. </span>
<span class="hljs-comment"># I recommend to use the headless option at least, out of the 3</span>
option.add_argument(<span class="hljs-string">'--headless'</span>)
option.add_argument(<span class="hljs-string">'--no-sandbox'</span>)
option.add_argument(<span class="hljs-string">'--disable-dev-sh-usage'</span>)
<span class="hljs-comment"># Replace YOUR-PATH-TO-CHROMEDRIVER with your chromedriver location</span>
driver = webdriver.Chrome(<span class="hljs-string">'YOUR-PATH-TO-CHROMEDRIVER'</span>, options=option)

page = driver.get(<span class="hljs-string">'https://www.imdb.com/chart/top/'</span>) <span class="hljs-comment"># Getting page HTML through request</span>
soup = BeautifulSoup(driver.page_source, <span class="hljs-string">'html.parser'</span>) <span class="hljs-comment"># Parsing content using beautifulsoup</span>

totalScrapedInfo = [] <span class="hljs-comment"># In this list we will save all the information we scrape</span>
links = soup.select(<span class="hljs-string">"table tbody tr td.titleColumn a"</span>) <span class="hljs-comment"># Selecting all of the anchors with titles</span>
first10 = links[:<span class="hljs-number">10</span>] <span class="hljs-comment"># Keep only the first 10 anchors</span>
<span class="hljs-keyword">for</span> anchor <span class="hljs-keyword">in</span> first10:
    driver.get(<span class="hljs-string">'https://www.imdb.com/'</span> + anchor[<span class="hljs-string">'href'</span>]) <span class="hljs-comment"># Access the movie’s page</span>
    infolist = driver.find_elements_by_css_selector(<span class="hljs-string">'.ipc-inline-list'</span>)[<span class="hljs-number">0</span>] <span class="hljs-comment"># Find the first element with class ‘ipc-inline-list’</span>
    informations = infolist.find_elements_by_css_selector(<span class="hljs-string">"[role='presentation']"</span>) <span class="hljs-comment"># Find all elements with role=’presentation’ from the first element with class ‘ipc-inline-list’</span>
    scrapedInfo = {
        <span class="hljs-string">"title"</span>: anchor.text,
        <span class="hljs-string">"year"</span>: informations[<span class="hljs-number">0</span>].text,
        <span class="hljs-string">"duration"</span>: informations[<span class="hljs-number">2</span>].text,
    } <span class="hljs-comment"># Save all the scraped information in a dictionary</span>
    totalScrapedInfo.append(scrapedInfo) <span class="hljs-comment"># Append the dictionary to the totalScrapedInformation list</span>

print(totalScrapedInfo) <span class="hljs-comment"># Display the list with all the information we scraped</span>
</code></pre>
<h3 id="heading-how-to-extract-dynamically-loaded-content-using-selenium">How to Extract Dynamically Loaded Content Using Selenium</h3>
<p>The next big step in web scraping is extracting content that is loaded dynamically. You can find such content on each of the movie’s pages (such as <a target="_blank" href="https://www.imdb.com/title/tt0111161/">https://www.imdb.com/title/tt0111161/</a>) in the Editorial Lists section. </p>
<p>If you look using inspect on the page, you'll see that you can find the section as an element with the attribute <strong><code>data-testid</code></strong> set as <strong><code>firstListCardGroup-editorial</code></strong>. But if you look in the page source, you will not find this attribute value anywhere. That’s because the Editorial Lists section is loaded by IMDB dynamically.</p>
<p>In the following example, we will scrape the editorial list of each movie and add it to our current results of the total scraped information. </p>
<p>To do that, we will import a few more packages that make it possible to wait for our dynamic content to load.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup
<span class="hljs-keyword">from</span> selenium <span class="hljs-keyword">import</span> webdriver
<span class="hljs-keyword">from</span> selenium.webdriver.common.by <span class="hljs-keyword">import</span> By
<span class="hljs-keyword">from</span> selenium.webdriver.support.ui <span class="hljs-keyword">import</span> WebDriverWait
<span class="hljs-keyword">from</span> selenium.webdriver.support <span class="hljs-keyword">import</span> expected_conditions <span class="hljs-keyword">as</span> EC

option = webdriver.ChromeOptions()
<span class="hljs-comment"># I use the following options as my machine is a window subsystem linux. </span>
<span class="hljs-comment"># I recommend to use the headless option at least, out of the 3</span>
option.add_argument(<span class="hljs-string">'--headless'</span>)
option.add_argument(<span class="hljs-string">'--no-sandbox'</span>)
option.add_argument(<span class="hljs-string">'--disable-dev-sh-usage'</span>)
<span class="hljs-comment"># Replace YOUR-PATH-TO-CHROMEDRIVER with your chromedriver location</span>
driver = webdriver.Chrome(<span class="hljs-string">'YOUR-PATH-TO-CHROMEDRIVER'</span>, options=option)

page = driver.get(<span class="hljs-string">'https://www.imdb.com/chart/top/'</span>) <span class="hljs-comment"># Getting page HTML through request</span>
soup = BeautifulSoup(driver.page_source, <span class="hljs-string">'html.parser'</span>) <span class="hljs-comment"># Parsing content using beautifulsoup</span>

totalScrapedInfo = [] <span class="hljs-comment"># In this list we will save all the information we scrape</span>
links = soup.select(<span class="hljs-string">"table tbody tr td.titleColumn a"</span>) <span class="hljs-comment"># Selecting all of the anchors with titles</span>
first10 = links[:<span class="hljs-number">10</span>] <span class="hljs-comment"># Keep only the first 10 anchors</span>
<span class="hljs-keyword">for</span> anchor <span class="hljs-keyword">in</span> first10:
    driver.get(<span class="hljs-string">'https://www.imdb.com/'</span> + anchor[<span class="hljs-string">'href'</span>]) <span class="hljs-comment"># Access the movie’s page </span>
    infolist = driver.find_elements_by_css_selector(<span class="hljs-string">'.ipc-inline-list'</span>)[<span class="hljs-number">0</span>] <span class="hljs-comment"># Find the first element with class ‘ipc-inline-list’</span>
    informations = infolist.find_elements_by_css_selector(<span class="hljs-string">"[role='presentation']"</span>) <span class="hljs-comment"># Find all elements with role=’presentation’ from the first element with class ‘ipc-inline-list’</span>
    scrapedInfo = {
        <span class="hljs-string">"title"</span>: anchor.text,
        <span class="hljs-string">"year"</span>: informations[<span class="hljs-number">0</span>].text,
        <span class="hljs-string">"duration"</span>: informations[<span class="hljs-number">2</span>].text,
    } <span class="hljs-comment"># Save all the scraped information in a dictionary</span>
    WebDriverWait(driver, <span class="hljs-number">5</span>).until(EC.visibility_of_element_located((By.CSS_SELECTOR, <span class="hljs-string">"[data-testid='firstListCardGroup-editorial']"</span>)))  <span class="hljs-comment"># We are waiting for 5 seconds for our element with the attribute data-testid set as `firstListCardGroup-editorial`</span>
    listElements = driver.find_elements_by_css_selector(<span class="hljs-string">"[data-testid='firstListCardGroup-editorial'] .listName"</span>) <span class="hljs-comment"># Extracting the editorial lists elements</span>
    listNames = [] <span class="hljs-comment"># Creating an empty list and then appending only the elements texts</span>
    <span class="hljs-keyword">for</span> el <span class="hljs-keyword">in</span> listElements:
        listNames.append(el.text)
    scrapedInfo[<span class="hljs-string">'editorial-list'</span>] = listNames <span class="hljs-comment"># Adding the editorial list names to our scrapedInfo dictionary</span>
    totalScrapedInfo.append(scrapedInfo) <span class="hljs-comment"># Append the dictionary to the totalScrapedInformation list</span>

print(totalScrapedInfo) <span class="hljs-comment"># Display the list with all the information we scraped</span>
</code></pre>
<p>For the previous example, you should get the following output:</p>
<p><img src="https://lh4.googleusercontent.com/geHhbKeeP2ATtz-OnIx9MATB3UvXcrobnO4eUNOLrzQll9ebPlq_2PqKaT_oT6e-3h7NmRkRh_9mrDuSvuW3Wbs3sRi1iuM3paCa8HBpTqWrZuSQc8sIu5y4EVZ_5j-60TmPs71Z" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-how-to-save-the-scraped-content">How to Save the Scraped Content</h2>
<p>Now that we have all the data we want, we can save it as a .json or a .csv file for easier readability. </p>
<p>To do that, we will just use the JSON and CVS packages from Python and write our content to new files:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> csv
<span class="hljs-keyword">import</span> json

...

file = open(<span class="hljs-string">'movies.json'</span>, mode=<span class="hljs-string">'w'</span>, encoding=<span class="hljs-string">'utf-8'</span>)
file.write(json.dumps(totalScrapedInfo))

writer = csv.writer(open(<span class="hljs-string">"movies.csv"</span>, <span class="hljs-string">'w'</span>))
<span class="hljs-keyword">for</span> movie <span class="hljs-keyword">in</span> totalScrapedInfo:
    writer.writerow(movie.values())
</code></pre>
<h2 id="heading-scraping-tips-and-tricks">Scraping Tips and Tricks</h2>
<p>While our guide so far is already advanced enough to take care of JavaScript rendering scenarios, there are still many things to explore in Selenium. </p>
<p>In this section, I will share some tips and tricks that may come in handy.</p>
<h3 id="heading-1-time-your-requests">1. Time your requests</h3>
<p>If you spam a server with hundreds of requests in a short time, it’s very probable that at some point, a captcha code will appear, or your IP might even get blocked. Unfortunately, there is no workaround in Python to avoid that. </p>
<p>Therefore, you should put some timeout breaks between each request so that the traffic will look more natural.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> time
<span class="hljs-keyword">import</span> requests

page = requests.get(<span class="hljs-string">'https://www.imdb.com/chart/top/'</span>) <span class="hljs-comment"># Getting page HTML through request</span>
time.sleep(<span class="hljs-number">30</span>) <span class="hljs-comment"># Wait 30 seconds</span>
page = requests.get(<span class="hljs-string">'https://www.imdb.com/'</span>) <span class="hljs-comment"># Getting page HTML through request</span>
</code></pre>
<h3 id="heading-2-error-handling">2. Error handling</h3>
<p>Since websites are dynamic and they can change structure at any moment, error handling might come in handy if you use the same web scraper frequently.</p>
<pre><code class="lang-python"><span class="hljs-keyword">try</span>:
    WebDriverWait(driver, <span class="hljs-number">5</span>).until(EC.presence_of_element_located((By.CSS_SELECTOR, <span class="hljs-string">"your selector"</span>)))
    <span class="hljs-keyword">break</span>
<span class="hljs-keyword">except</span> TimeoutException:
    <span class="hljs-comment"># If the loading took too long, print message and try again</span>
    print(<span class="hljs-string">"Loading took too much time!"</span>)
</code></pre>
<p>The try and error syntax can be useful when you’re waiting for an element, extracting it, or even when you’re just making the request.</p>
<h3 id="heading-3-take-screenshots">3. Take Screenshots</h3>
<p>If you need to obtain a screenshot of the web page you are scraping at any moment, you can use:</p>
<pre><code class="lang-python">driver.save_screenshot(‘screenshot-file-name.png’)
</code></pre>
<p>This can help debug when you’re working with dynamically loaded content.</p>
<h3 id="heading-4-read-the-documentation">4. Read the documentation</h3>
<p>Last but not least, don’t forget to read the <a target="_blank" href="https://selenium-python.readthedocs.io/">documentation from Selenium</a>. This library contains information about how to do most of the actions you can do in a browser. </p>
<p>Using Selenium, you can fill out forms, press buttons, answer popup messages, and do many other cool things. </p>
<p>If you’re facing a new problem, their documentation can be your best friend.</p>
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>This article’s purpose is to give you an advanced introduction to web scraping using Python with Selenium and BeautifulSoup. While there are still many features from both technologies to explore, you now have a solid base on how to start scraping.</p>
<p>Sometimes web scraping can be very difficult, as websites start to put more and more obstacles in the developer’s way. Some of these obstacles can be Captcha codes, IP blocks, or dynamic content. Overcoming them just with Python and Selenium might be difficult or even impossible. </p>
<p>So, I’ll give you an alternative as well. Try using a <a target="_blank" href="https://webscrapingapi.com">web scraping API</a> that solves all those challenges for you. It also uses rotating proxies so that you don’t have to worry about adding timeouts between requests. Just remember to always check if the data you want can be lawfully extracted and used.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Scrape Websites with Node.js and Cheerio ]]>
                </title>
                <description>
                    <![CDATA[ There might be times when a website has data you want to analyze but the site doesn't expose an API for accessing those data. To get the data, you'll have to resort to web scraping. In this article, I'll go over how to scrape websites with Node.js an... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-scrape-websites-with-node-js-and-cheerio/</link>
                <guid isPermaLink="false">66d45f6da326133d12440a09</guid>
                
                    <category>
                        <![CDATA[ node ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Node.js ]]>
                    </category>
                
                    <category>
                        <![CDATA[ web scraping ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Joseph Mawa ]]>
                </dc:creator>
                <pubDate>Mon, 19 Jul 2021 16:50:30 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/07/scraping-1.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>There might be times when a website has data you want to analyze but the site doesn't expose an API for accessing those data.</p>
<p>To get the data, you'll have to resort to <a target="_blank" href="https://en.wikipedia.org/wiki/Web_scraping">web scraping</a>.</p>
<p>In this article, I'll go over how to scrape websites with <a target="_blank" href="https://nodejs.dev/">Node.js</a> and <a target="_blank" href="https://cheerio.js.org/">Cheerio</a>.</p>
<p>Before we start, you should be aware that there are some <a target="_blank" href="https://monashdatafluency.github.io/python-web-scraping/section-5-legal-and-ethical-considerations/">legal and ethical issues</a> you should consider before scraping a site. It's your responsibility to make sure that it's okay to scrape a site before doing so.</p>
<p>The sites used in the examples throughout this article all allow scraping, so feel free to follow along.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Here are some things you'll need for this tutorial:</p>
<ul>
<li><p>You need to have <a target="_blank" href="https://nodejs.dev">Node.js</a> installed. If you don't have Node, just make sure you download it for your system from the <a target="_blank" href="https://nodejs.dev/download/">Node.js downloads page</a></p>
</li>
<li><p>You need to have a text editor like <a target="_blank" href="https://code.visualstudio.com/">VSCode</a> or <a target="_blank" href="https://atom.io/">Atom</a> installed on your machine</p>
</li>
<li><p>You should have at least a basic understanding of JavaScript, Node.js, and the Document Object Model (DOM). But you can still follow along even if you are a total beginner with these technologies. Feel free to ask questions on the <a target="_blank" href="https://forum.freecodecamp.org/">freeCodeCamp forum</a> if you get stuck</p>
</li>
</ul>
<h2 id="heading-what-is-web-scraping">What is Web Scraping?</h2>
<blockquote>
<p><a target="_blank" href="https://en.wikipedia.org/wiki/Web_scraping">Web scraping</a> is the process of extracting data from a web page. Though you can do web scraping manually, the term usually refers to automated data extraction from websites - <a target="_blank" href="\(https://en.wikipedia.org/wiki/Web_scraping\)">Wikipedia</a>.</p>
</blockquote>
<h2 id="heading-what-is-cheerio">What is Cheerio?</h2>
<p>Cheerio is a tool for parsing HTML and XML in Node.js, and is very popular with over <a target="_blank" href="https://github.com/cheeriojs/cheerio">23k stars</a> on GitHub.</p>
<p>It is fast, flexible, and easy to use. Since it implements a subset of JQuery, it's easy to start using Cheerio if you're already familiar with JQuery.</p>
<p>According to the <a target="_blank" href="https://cheerio.js.org/">documentation</a>, Cheerio parses markup and provides an API for manipulating the resulting data structure but does not interpret the result like a web browser.</p>
<blockquote>
<p>The major difference between cheerio and a web browser is that cheerio does not produce visual rendering, load CSS, load external resources or execute JavaScript. It simply parses markup and provides an API for manipulating the resulting data structure. That explains why it is also very fast - <a target="_blank" href="https://cheerio.js.org/">cheerio documentation</a>.</p>
</blockquote>
<p>If you want to use cheerio for scraping a web page, you need to first fetch the markup using packages like <a target="_blank" href="https://axios-http.com/docs/intro">axios</a> or <a target="_blank" href="https://www.npmjs.com/package/node-fetch">node-fetch</a> among others.</p>
<h2 id="heading-how-to-scrape-a-web-page-in-node-using-cheerio">How to Scrape a Web Page in Node Using Cheerio</h2>
<p>In this section, you will learn how to scrape a web page using cheerio. It is important to point out that before scraping a website, make sure you have permission to do so – or you might find yourself violating terms of service, breaching copyright, or violating privacy.</p>
<p>In this example, we will scrape the <a target="_blank" href="https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3#:~:text=ISO%203166%2D1%20alpha%2D3%20codes%20are%20three%2Dletter,special%20areas%20of%20geographical%20interest.">ISO 3166-1 alpha-3 codes</a> for all countries and other jurisdictions as listed on <a target="_blank" href="https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3">this Wikipedia page</a>. It is under the <strong>Current codes</strong> section of the <a target="_blank" href="https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3">ISO 3166-1 alpha-3</a> page.</p>
<p>This is what the list of countries/jurisdictions and their corresponding codes look like:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/07/007-05-list-of-countries.png" alt="list of countries and corresponding codes" width="600" height="400" loading="lazy"></p>
<p>You can follow the steps below to scrape the data in the above list.</p>
<h3 id="heading-step-1-create-a-working-directory">Step 1 - Create a Working Directory</h3>
<p>In this step, you will create a directory for your project by running the command below on the terminal. The command will create a directory called <code>learn-cheerio</code>. You can give it a different name if you wish.</p>
<pre><code class="lang-sh">mkdir learn-cheerio
</code></pre>
<p>You should be able to see a folder named <code>learn-cheerio</code> created after successfully running the above command.</p>
<p>In the next step, you will open the directory you have just created in your favorite text editor and initialize the project.</p>
<h3 id="heading-step-2-initialize-the-project">Step 2 - Initialize the Project</h3>
<p>In this step, you will navigate to your project directory and initialize the project. Open the directory you created in the previous step in your favorite text editor and initialize the project by running the command below.</p>
<pre><code class="lang-js">npm init -y
</code></pre>
<p>Successfully running the above command will create a <code>package.json</code> file at the root of your project directory.</p>
<p>In the next step, you will install project dependencies.</p>
<h3 id="heading-step-3-install-dependencies">Step 3 - Install Dependencies</h3>
<p>In this step, you will install project dependencies by running the command below. This will take a couple of minutes, so just be patient.</p>
<pre><code class="lang-js">npm i axios cheerio pretty
</code></pre>
<p>Successfully running the above command will register three dependencies in the <code>package.json</code> file under the <code>dependencies</code> field. The first dependency is <code>axios</code>, the second is <code>cheerio</code>, and the third is <code>pretty</code>.</p>
<p><a target="_blank" href="https://axios-http.com/docs/intro">axios</a> is a very popular <a target="_blank" href="https://stackoverflow.com/questions/49950973/difference-between-http-client-and-rest-client">http client</a> which works in node and in the browser. We need it because cheerio is a markup parser.</p>
<p>For cheerio to parse the markup and scrape the data you need, we need to use <code>axios</code> for fetching the markup from the website. You can use another HTTP client to fetch the markup if you wish. It doesn't necessarily have to be <code>axios</code>.</p>
<p><a target="_blank" href="https://www.npmjs.com/package/pretty">pretty</a> is npm package for beautifying the markup so that it is readable when printed on the terminal.</p>
<p>In the next section, you will inspect the markup you will scrape data from.</p>
<h3 id="heading-step-4-inspect-the-web-page-you-want-to-scrape">Step 4 - Inspect the Web Page You Want to Scrape</h3>
<p>Before you scrape data from a web page, it is very important to understand the HTML structure of the page.</p>
<p>In this step, you will inspect the HTML structure of the web page you are going to scrape data from.</p>
<p>Navigate to <a target="_blank" href="https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3">ISO 3166-1 alpha-3 codes</a> page on Wikipedia. Under the "Current codes" section, there is a list of countries and their corresponding codes. You can open the DevTools by pressing the key combination <code>CTRL + SHIFT + I</code> on chrome or right-click and then select "Inspect" option.</p>
<p>This is what the list looks like for me in chrome DevTools:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/07/007-04-dev-tool.png" alt="List in chrome devtools" width="600" height="400" loading="lazy"></p>
<p>In the next section, you will write code for scraping the web page.</p>
<h3 id="heading-step-5-write-the-code-to-scrape-the-data">Step 5 - Write the Code to Scrape the Data</h3>
<p>In this section, you will write code for scraping the data we are interested in. Start by running the command below which will create the <code>app.js</code> file.</p>
<pre><code class="lang-js">touch app.js
</code></pre>
<p>Successfully running the above command will create an <code>app.js</code> file at the root of the project directory.</p>
<p>Like any other Node package, you must first <em>require</em> <code>axios</code>, <code>cheerio</code>, and <code>pretty</code> before you start using them. You can do so by adding the code below at the top of the <code>app.js</code> file you have just created.</p>
<pre><code class="lang-js"><span class="hljs-keyword">const</span> axios = <span class="hljs-built_in">require</span>(<span class="hljs-string">"axios"</span>);
<span class="hljs-keyword">const</span> cheerio = <span class="hljs-built_in">require</span>(<span class="hljs-string">"cheerio"</span>);
<span class="hljs-keyword">const</span> pretty = <span class="hljs-built_in">require</span>(<span class="hljs-string">"pretty"</span>);
</code></pre>
<p>Before we write code for scraping our data, we need to learn the basics of <code>cheerio</code>. We'll parse the markup below and try manipulating the resulting data structure. This will help us learn cheerio syntax and its most common methods.</p>
<p>The markup below is the <code>ul</code> element containing our <code>li</code> elements.</p>
<pre><code class="lang-js"><span class="hljs-keyword">const</span> markup = <span class="hljs-string">`
&lt;ul class="fruits"&gt;
  &lt;li class="fruits__mango"&gt; Mango &lt;/li&gt;
  &lt;li class="fruits__apple"&gt; Apple &lt;/li&gt;
&lt;/ul&gt;
`</span>;
</code></pre>
<p>Add the above variable declaration to the <code>app.js</code> file</p>
<h2 id="heading-how-to-load-markup-in-cheerio">How to Load Markup in Cheerio</h2>
<p>You can load markup in <code>cheerio</code> using the <code>cheerio.load</code> method. The method takes the markup as an argument. It also takes two more optional arguments. You can read more about them <a target="_blank" href="https://cheerio.js.org/">in the documentation</a> if you are interested.</p>
<p>Below, we are passing the first and the only required argument and storing the returned value in the <code>$</code> variable. We are using the <code>$</code> variable because of cheerio's similarity to <a target="_blank" href="https://jquery.com/">Jquery</a>. You can use a different variable name if you wish.</p>
<p>Add the code below to your <code>app.js</code> file:</p>
<pre><code class="lang-js"><span class="hljs-keyword">const</span> $ = cheerio.load(markup);
<span class="hljs-built_in">console</span>.log(pretty($.html()));
</code></pre>
<p>If you now execute the code in your <code>app.js</code> file by running the command <code>node app.js</code> on the terminal, you should be able to see the markup on the terminal. This is what I see on my terminal:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/07/007-01-cheerio-html.png" alt="Markup terminal output" width="600" height="400" loading="lazy"></p>
<h2 id="heading-how-to-select-an-element-in-cheerio">How to Select an Element in Cheerio</h2>
<p>Cheerio supports most of the common CSS selectors such as the <code>class</code>, <code>id</code>, and <code>element</code> selectors among others. In the code below, we are selecting the element with class <code>fruits__mango</code> and then logging the selected element to the console. Add the code below to your <code>app.js</code> file.</p>
<pre><code class="lang-js"><span class="hljs-keyword">const</span> mango = $(<span class="hljs-string">".fruits__mango"</span>);
<span class="hljs-built_in">console</span>.log(mango.html()); <span class="hljs-comment">// Mango</span>
</code></pre>
<p>The above lines of code will log the text <code>Mango</code> on the terminal if you execute <code>app.js</code> using the command <code>node app.js</code>.</p>
<h2 id="heading-how-to-get-the-attribute-of-an-element-in-cheerio">How to Get the Attribute of an Element in Cheerio</h2>
<p>You can also select an element and get a specific attribute such as the <code>class</code>, <code>id</code>, or all the attributes and their corresponding values.</p>
<p>Add the code below to your <code>app.js</code> file:</p>
<pre><code class="lang-js"><span class="hljs-keyword">const</span> apple = $(<span class="hljs-string">".fruits__apple"</span>);
<span class="hljs-built_in">console</span>.log(apple.attr(<span class="hljs-string">"class"</span>)); <span class="hljs-comment">//fruits__apple</span>
</code></pre>
<p>The above code will log <code>fruits__apple</code> on the terminal. <code>fruits__apple</code> is the class of the selected element.</p>
<h2 id="heading-how-to-loop-through-a-list-of-elements-in-cheerio">How to Loop Through a List of Elements in Cheerio</h2>
<p>Cheerio provides the <code>.each</code> method for looping through several selected elements.</p>
<p>Below, we are selecting all the <code>li</code> elements and looping through them using the <code>.each</code> method. We log the text content of each list item on the terminal.</p>
<p>Add the code below to your <code>app.js</code> file.</p>
<pre><code class="lang-js"><span class="hljs-keyword">const</span> listItems = $(<span class="hljs-string">"li"</span>);
<span class="hljs-built_in">console</span>.log(listItems.length); <span class="hljs-comment">// 2</span>
listItems.each(<span class="hljs-function"><span class="hljs-keyword">function</span> (<span class="hljs-params">idx, el</span>) </span>{
  <span class="hljs-built_in">console</span>.log($(el).text());
});
<span class="hljs-comment">// Mango</span>
<span class="hljs-comment">// Apple</span>
</code></pre>
<p>The above code will log <code>2</code>, which is the length of the list items, and the text <code>Mango</code> and <code>Apple</code> on the terminal after executing the code in <code>app.js</code>.</p>
<h2 id="heading-how-to-append-or-prepend-an-element-to-a-markup-in-cheerio">How to Append or Prepend an Element to a Markup in Cheerio</h2>
<p>Cheerio provides a method for appending or prepending an element to a markup.</p>
<p>The <code>append</code> method will add the element passed as an argument after the last child of the selected element. On the other hand, <code>prepend</code> will add the passed element before the first child of the selected element.</p>
<p>Add the code below to your <code>app.js</code> file:</p>
<pre><code class="lang-js"><span class="hljs-keyword">const</span> ul = $(<span class="hljs-string">"ul"</span>);
ul.append(<span class="hljs-string">"&lt;li&gt;Banana&lt;/li&gt;"</span>);
ul.prepend(<span class="hljs-string">"&lt;li&gt;Pineapple&lt;/li&gt;"</span>);
<span class="hljs-built_in">console</span>.log(pretty($.html()));
</code></pre>
<p>After appending and prepending elements to the markup, this is what I see when I log <code>$.html()</code> on the terminal:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/07/007-02-append-prepend.png" alt="Append or prepend terminal output" width="600" height="400" loading="lazy"></p>
<p>Those are the basics of cheerio that can get you started with web scraping.</p>
<p>To scrape the data we described at the beginning of this article from Wikipedia, copy and paste the code below in the <code>app.js</code> file:</p>
<pre><code class="lang-js"><span class="hljs-comment">// Loading the dependencies. We don't need pretty</span>
<span class="hljs-comment">// because we shall not log html to the terminal</span>
<span class="hljs-keyword">const</span> axios = <span class="hljs-built_in">require</span>(<span class="hljs-string">"axios"</span>);
<span class="hljs-keyword">const</span> cheerio = <span class="hljs-built_in">require</span>(<span class="hljs-string">"cheerio"</span>);
<span class="hljs-keyword">const</span> fs = <span class="hljs-built_in">require</span>(<span class="hljs-string">"fs"</span>);

<span class="hljs-comment">// URL of the page we want to scrape</span>
<span class="hljs-keyword">const</span> url = <span class="hljs-string">"https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3"</span>;

<span class="hljs-comment">// Async function which scrapes the data</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">scrapeData</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-keyword">try</span> {
    <span class="hljs-comment">// Fetch HTML of the page we want to scrape</span>
    <span class="hljs-keyword">const</span> { data } = <span class="hljs-keyword">await</span> axios.get(url);
    <span class="hljs-comment">// Load HTML we fetched in the previous line</span>
    <span class="hljs-keyword">const</span> $ = cheerio.load(data);
    <span class="hljs-comment">// Select all the list items in plainlist class</span>
    <span class="hljs-keyword">const</span> listItems = $(<span class="hljs-string">".plainlist ul li"</span>);
    <span class="hljs-comment">// Stores data for all countries</span>
    <span class="hljs-keyword">const</span> countries = [];
    <span class="hljs-comment">// Use .each method to loop through the li we selected</span>
    listItems.each(<span class="hljs-function">(<span class="hljs-params">idx, el</span>) =&gt;</span> {
      <span class="hljs-comment">// Object holding data for each country/jurisdiction</span>
      <span class="hljs-keyword">const</span> country = { <span class="hljs-attr">name</span>: <span class="hljs-string">""</span>, <span class="hljs-attr">iso3</span>: <span class="hljs-string">""</span> };
      <span class="hljs-comment">// Select the text content of a and span elements</span>
      <span class="hljs-comment">// Store the textcontent in the above object</span>
      country.name = $(el).children(<span class="hljs-string">"a"</span>).text();
      country.iso3 = $(el).children(<span class="hljs-string">"span"</span>).text();
      <span class="hljs-comment">// Populate countries array with country data</span>
      countries.push(country);
    });
    <span class="hljs-comment">// Logs countries array to the console</span>
    <span class="hljs-built_in">console</span>.dir(countries);
    <span class="hljs-comment">// Write countries array in countries.json file</span>
    fs.writeFile(<span class="hljs-string">"coutries.json"</span>, <span class="hljs-built_in">JSON</span>.stringify(countries, <span class="hljs-literal">null</span>, <span class="hljs-number">2</span>), <span class="hljs-function">(<span class="hljs-params">err</span>) =&gt;</span> {
      <span class="hljs-keyword">if</span> (err) {
        <span class="hljs-built_in">console</span>.error(err);
        <span class="hljs-keyword">return</span>;
      }
      <span class="hljs-built_in">console</span>.log(<span class="hljs-string">"Successfully written data to file"</span>);
    });
  } <span class="hljs-keyword">catch</span> (err) {
    <span class="hljs-built_in">console</span>.error(err);
  }
}
<span class="hljs-comment">// Invoke the above function</span>
scrapeData();
</code></pre>
<p>Do you understand what is happening by reading the code? If not, I'll go into some detail now. I have also made comments on each line of code to help you understand.</p>
<p>In the above code, we <strong>require</strong> all the dependencies at the top of the <code>app.js</code> file and then we declared the <code>scrapeData</code> function. Inside the function, the markup is fetched using <code>axios</code>. The fetched HTML of the page we need to scrape is then loaded in <code>cheerio</code>.</p>
<p>The list of countries/jurisdictions and their corresponding <code>iso3</code> codes are nested in a <code>div</code> element with a class of <code>plainlist</code>. The <code>li</code> elements are selected and then we loop through them using the <code>.each</code> method. The data for each country is scraped and stored in an array.</p>
<p>After running the code above using the command <code>node app.js</code>, the scraped data is written to the <code>countries.json</code> file and printed on the terminal. This is part of what I see on my terminal:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/07/007-03-terminal-output.png" alt="Terminal output" width="600" height="400" loading="lazy"></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Thank you for reading this article and reaching the end! We have covered the basics of web scraping using <code>cheerio</code>. You can head over to the <a target="_blank" href="https://cheerio.js.org/">cheerio documentation</a> if you want to dive deeper and fully understand how it works.</p>
<p>Feel free to ask questions on the <a target="_blank" href="https://forum.freecodecamp.org/">freeCodeCamp forum</a> if there is anything you don't understand in this article.</p>
<p>Finally, remember to consider the <a target="_blank" href="https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01">ethical concerns</a> as you learn web scraping.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
