Scenario	Traditional Scraping	AI Scraping
Stable websites	✅ Best choice	✅ Works but may sometimes become an overkill
Frequently changing layouts	❌ Breaks often	✅ More resilient
Large-scale crawling	✅ More cost-efficient	✅ Efficient but can get expensive
Fast prototyping	❌ Slower setup	✅ Very fast
Non-technical users	❌ Requires coding	✅ More accessible
Full control & transparency	✅ High control	❌ Less transparent
Messy or inconsistent data	❌ Hard to maintain	✅ Easier to handle
Complex workflows (login, steps)	⚠️ Possible but manual	✅ Often built-in

web scraping - freeCodeCamp.org

How to Run an AI Extractability Audit on Your Site (I Found 6 Heading Tags That Cost Me Citations)

Chudi Nnorukam — Wed, 22 Jul 2026 23:16:07 +0000

When an AI assistant answers a question, it lifts sentences from a handful of pages and cites them. Whether your page is liftable is not a mystery or a vibe. It's a set of mechanical properties of your HTML that you can measure, score, and fix.

This tutorial walks through the exact audit I ran on my own site, the six invisible heading tags it caught, the one-commit fix, and the CI gate that keeps the problem from coming back.

Here is the punchline up front: my homepage scored 65 out of 100 on extractability. The cause was five UI card components that rendered their titles as

and
tags. Demoting those six headings to ARIA-preserving paragraphs, without changing a single visible pixel or removing one word of content, took the page to 100.

Over the last 90 days, Microsoft's Bing Webmaster Tools reports 1,600 AI citations across 33 of my pages. Extraction is the stage of that pipeline this tutorial teaches you to audit.

Table of Contents

What an Extractability Audit Actually Tests

Prerequisites

Step 1: Pick the Pages Worth Auditing

Step 2: Run the Five Checks

Step 3: Read Your Failure Classes

Step 4: Find the Components Emitting Fake Headings

Step 5: Demote the Headings Without Breaking Accessibility

Step 6: Gate the Fix in CI

What Actually Moved

What I Rejected, and Why

FAQ

What You Accomplished

What an Extractability Audit Actually Tests

A citation from an AI engine is the last step of a three-stage machine pipeline, and your page has to pass every stage:

Retrieve: the engine's crawler is allowed to fetch your page, and does.
Extract: the model finds a clean, self-contained answer in your markup.
Attribute: the engine is confident enough about who said it to put your name next to it.

Most AI-visibility advice concentrates on stage 1 (robots.txt, sitemaps, llms.txt) and stage 3 (schema, entity signals). Stage 2 is where I've found the cheapest wins, because it's pure HTML engineering, and because it fails silently: a page that retrieves fine and attributes fine but extracts poorly simply never appears in answers, and nothing tells you why.

Extractability is the measurable version of stage 2: can a parser walking your rendered HTML find self-contained answer blocks under clearly scoped headings? The audit in this tutorial scores that on a 0 to 100 scale using five checks, each of which you can verify by hand:

Check	What it tests	Weight
F1	The first sentence under every H2 stands alone as an answer	30
F2	The first 200 tokens of the page contain a direct answer	20
F3	Each H2 section opens with an answer in the 40 to 60 word band	20
F4	Share of H2/H3 headings phrased as questions a user would type	20
F5	An FAQ section exists at the article footer	10

A score of 75 or above lands in the EXTRACTABLE band. 40 to 74 is PARTIALLY-EXTRACTABLE. Below 40 is NOT-EXTRACTABLE. The bands come from the AI Visibility Readiness framework I maintain, but the five checks themselves are engine-agnostic: they encode how retrieval-augmented systems chunk pages by heading, embed the chunks, and lift the opening sentences of whichever chunk matches the query.

The critical detail for this tutorial: the audit counts every

,
, and
in your rendered DOM. Not the headings you wrote in your CMS. The headings your component library emits. That gap is where my six invisible failures lived.

Prerequisites

A live website you can measure and deploy (Any stack. My examples are SvelteKit, and every fix translates to React, Vue, or plain HTML.)

Python 3.10+ with `requests` and `beautifulsoup4` (`pip install requests beautifulsoup4`)

Access to your search console data (Google Search Console or Bing Webmaster Tools) to pick pages

A CI system (the example uses GitHub Actions)

About 90 minutes: 20 for the audit, 40 for the fix, 30 for the CI gate

Step 1: Pick the Pages Worth Auditing

Don't audit your whole sitemap. Audit the pages that already have distribution, because extraction fixes multiply whatever retrieval you already earn.

Open Google Search Console, go to Performance, sort pages by impressions over the last 28 days, and look at where your distribution actually lives.

Here's the top of my own report from that export (July 21):

Page Impressions (28d) Clicks Avg position

/blog/claude-fable-5-vs-opus-4-8 17,315 462 6.2

/blog/how-i-built-polymarket-trading-bot 13,649 104 7.6

/blog/claude-code-production-trading-bot 6,540 94 8.5

/blog/aeo-answer-engine-optimization-explained 4,189 1 8.2

Individual posts dominate the impressions, but notice what every one of those posts has in common: they're all rendered by the same layout and card components.

Fixing a component fixes every page that uses it at once, which is why I scoped the audit to the top 3 to 5 content-index pages instead of individual posts: the homepage, your blog index, your topic or category hubs.

Index pages are assembled almost entirely from repeating cards, so they show component damage in its most concentrated form, and any fix propagates to everything else.

I chose these three:

`chudi.dev/` (the homepage)

`chudi.dev/blog` (the writing index)

`chudi.dev/topics` (the topic hub)

Artifact check: you should now have a written list of 3 to 5 URLs. That list is the audit's scope.

Page	Impressions (28d)	Clicks	Avg position
/blog/claude-fable-5-vs-opus-4-8	17,315	462	6.2
/blog/how-i-built-polymarket-trading-bot	13,649	104	7.6
/blog/claude-code-production-trading-bot	6,540	94	8.5
/blog/aeo-answer-engine-optimization-explained	4,189	1	8.2

Step 2: Run the Five Checks

You can score the five checks with about 60 lines of Python. This is a deliberately minimal version of the auditor I run in production. It implements the two checks that catch component damage (F3 and F4) plus a full heading census, which is enough to find the class of bug this tutorial fixes.

import re
import sys
import requests
from bs4 import BeautifulSoup

QUESTION = re.compile(
    r"^\s*(what|how|why|when|where|who|which|is|are|can|do|does|should|will|did)\b|\?\s*$",
    re.IGNORECASE,
)

def audit(url):
    html = requests.get(url, timeout=8, headers={"User-Agent": "extract-audit/1.0"}).text
    soup = BeautifulSoup(html, "html.parser")

    headings = [(h.name, " ".join(h.get_text().split())) for h in soup.find_all(["h1", "h2", "h3"])]
    subheads = [(n, t) for n, t in headings if n in ("h2", "h3")]

    question_rate = (
        sum(1 for _, t in subheads if QUESTION.search(t)) / len(subheads) if subheads else 0.0
    )

    in_band = 0
    h2s = soup.find_all("h2")
    for h2 in h2s:
        first_p = h2.find_next("p")
        words = len(first_p.get_text().split()) if first_p else 0
        if 40 <= words <= 60:
            in_band += 1

    print(f"URL: {url}")
    print(f"Heading census ({len(headings)} total):")
    for name, text in headings:
        print(f"  <{name}> {text[:70]}")
    print(f"F4 question-format rate: {question_rate:.1%} (target >= 50%)")
    print(f"F3 sections opening in the 40-60 word band: {in_band}/{len(h2s)}")

if __name__ == "__main__":
    audit(sys.argv[1])

Run it against each page on your list:

python3 extract_audit.py https://yoursite.com/

The heading census is the part to stare at. It prints every H1/H2/H3 a parser sees, in order, which is frequently not the outline you think you published.

If you want the full five-check scored version with the weighted 0 to 100 composite, the automated audit on citability.dev runs all five checks plus retrieval and attribution layers. The manual version above is enough to complete this tutorial.

Artifact check: a terminal output per page showing the heading census, the F4 rate, and the F3 band count. Screenshot it. It is your before-state.

Step 3: Read Your Failure Classes

Here's what the audit said about my homepage before the fix, pulled from the commit record of the remediation (2026-05-23):

Score: 65/100, PARTIALLY-EXTRACTABLE, ten points under the threshold
F4 question-format rate: 26.7%, far below the 50% pass line
Cause: more than ten headings in the census that I never wrote as headings

The census made the cause obvious. Alongside the section headings I had deliberately tuned ("How do I see it run live?", "What is the retrieval header?") sat a pile of statements like blog post titles and project names, each wrapped in

. I hadn't typed a single one of them into a heading field. My card components had.

This is the general lesson, and it is worth stating as a rule:

The denominator is the design problem. Every heading your components emit joins the denominator of every ratio check an extraction parser runs. Ten card titles as H3s means your carefully tuned question headings are outvoted 10 to 4 by markup you never see.

Failure classes map to fixes like this:

Symptom in the census Failure class Fix (Step)

Headings you never wrote, repeated in card-sized clusters Component-emitted headings Steps 4 and 5

Your own H2s are statements, not questions Authored heading style Rephrase to question form

Sections open with a 15-word teaser or a 120-word ramble Answer-band miss Densify openers to 40 to 60 words

No FAQ block Missing F5 surface Add one at the footer

I had all four classes across my three pages. The component class was the biggest single scorer, and it's the one nobody catches by reading their CMS, so it gets the deep treatment here. (For the record, the authored fixes on my other pages were exactly what the table says: two H2s on my framework page rephrased into question form, and a topic-hub opener expanded from 37 words to roughly 50 to enter the answer band.)

Artifact check: your census annotated with the four failure classes. Count how many headings you didn't author.

Symptom in the census	Failure class	Fix (Step)
Headings you never wrote, repeated in card-sized clusters	Component-emitted headings	Steps 4 and 5
Your own H2s are statements, not questions	Authored heading style	Rephrase to question form
Sections open with a 15-word teaser or a 120-word ramble	Answer-band miss	Densify openers to 40 to 60 words
No FAQ block	Missing F5 surface	Add one at the footer

Step 4: Find the Components Emitting Fake Headings

The census tells you fake headings exist. Your component library tells you where they come from. Grep for heading tags inside your component directory, not your content:

grep -rn "


(React: grep -rn ". Vue: same idea with .vue.)

On my site, this surfaced six heading sites across five components:



Component
Emitted
Instances



BlogCard.svelte
 post title
2


BlogCardFeatured.svelte
 post title
1


ProductCard.svelte
 product name
1


ProjectCard.svelte
 project name
1


JourneyCard.svelte
 milestone title
1


Six tags doesn't sound like much until you remember that cards repeat. One blog index rendering ten BlogCard instances injects ten 
 statements into that page's census. Every card-built page on the site inherits the same dilution, which is exactly why my content-index pages scored worst.
Why do component libraries do this? Because a card title looks like a heading, and because accessibility guidance rightly encourages semantic HTML.
The mistake is subtler: a card title is a link label into another document, not a section heading of this document. The page's real outline is "here are my featured posts", not the title of each post teased below it. HTML has no tag for "title of a different page", so components default to H2/H3, and every parser that walks the page inherits a false outline.
Artifact check: a table like the one above: component, tag emitted, instance count. This is your fix list.
Step 5: Demote the Headings Without Breaking Accessibility
The obvious fix, swapping 
 for a styled  or , has a real cost: screen reader users navigate by heading structure, and card titles are genuinely useful landmarks when scanning a list of posts. Deleting the semantics entirely trades an AI-extraction win for an accessibility loss. That trade isn't necessary.
The fix that preserves both is ARIA heading demotion: replace the literal tag with a paragraph carrying role="heading" and an explicit aria-level.
One important clarification before the diff: the first rule of ARIA is to prefer native HTML elements, and this fix doesn't violate it. The rule applies when the text genuinely is a heading of the current document, and the whole point of Step 4 was establishing that card titles are not. They are link labels into other documents.
Native 
 was the wrong semantics, while the ARIA role is a courtesy that keeps the list-scanning navigation screen reader users already rely on.
Here's the actual diff from my BlogCard.svelte, unchanged except for wrapping:
-
   {post.title}
-
+

What changes and what does not:

Assistive technology sees the same outline. role="heading" plus aria-level="3" is the ARIA-standard equivalent of an 
. Screen readers that navigate by heading still stop here and still announce the level.

Visual styling is untouched. Every class stays on the element. Zero pixels move.

Content is untouched. The fix removes zero words. This matters because most extraction advice tells you to rewrite. But this class of bug needs no rewriting.

HTML-tag parsers stop counting it. Extraction pipelines chunk by literal h1/h2/h3 elements. The card title exits the census, your authored headings get the denominator back, and the ratios you tuned start passing.


Apply the same one-line change at every site on your Step 4 fix list. Mine was one commit touching five components, six occurrences.
Then redeploy and re-run the Step 2 audit. My homepage went from 65 to 100/100 EXTRACTABLE on the post-deploy re-score, with the question-format rate recovering from 26.7% to above the 50% threshold, because the four question headings I had authored were finally the only H2/H3 population on the page.
Artifact check: the after-audit terminal output next to your before screenshot. The heading census should now contain only headings you wrote on purpose.
Step 6: Gate the Fix in CI
Here's the uncomfortable truth about extraction scores: they drift. Content changes, components get added, or a redesign ships a new card.
My homepage, re-audited live while writing this tutorial (July 21), sits at 80: still EXTRACTABLE, but down from its post-fix 100, because a homepage redesign in the intervening weeks changed the section structure again. The blog index and topic hub both still score 100.
That drift is why the durable deliverable of this tutorial isn't the fix. It's the regression gate. Without one, the next well-meaning component ships a new 
 and your score quietly decays. Nothing visible breaks, so nothing gets caught in review.
Mine runs as a GitHub Actions workflow triggered by every successful production deployment, and hard-fails if any audited URL drops out of the EXTRACTABLE band:
name: Post-Deploy Extractability Audit

on:
  deployment_status:

jobs:
  audit:
    if: |
      github.event.deployment_status.state == 'success' &&
      github.event.deployment.environment == 'Production'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.13"
      - run: pip install requests beautifulsoup4
      - name: Audit extractability on the live URLs
        run: |
          for url in "https://yoursite.com/" "https://yoursite.com/blog"; do
            python3 scripts/extract_audit.py "$url" --min-score 75 || exit 1
          done

To make the minimal auditor CI-ready, add a --min-score flag that exits nonzero below the threshold. That's a five-line change to the Step 2 script (compute the weighted score from the checks you implement, compare, sys.exit(1)).
The production version of my gate audits five URLs and stacks Lighthouse accessibility thresholds into the same workflow, so the ARIA-demotion contract from Step 5 is enforced from both directions: extraction can't regress below 75, and accessibility can't regress below 95. That pairing is the whole point. The two constraints keep each other honest.
Artifact check: a CI run in your Actions tab that fails when you feed it --min-score 101 (proving it can fail) and passes at 75.
What Actually Moved


The scoreboard for my three pages, all numbers from the same instrument:



Page
Before fix (May)
After fix
Live re-audit (July 21)



Homepage
65 PARTIALLY-EXTRACTABLE
100 EXTRACTABLE
80 EXTRACTABLE


Blog index
below threshold
100 EXTRACTABLE
100 EXTRACTABLE


Topic hub
below threshold
100 EXTRACTABLE
100 EXTRACTABLE




And the downstream metric the audit exists to serve: Bing Webmaster Tools' AI Performance report (the only first-party AI citation dashboard that currently exists. You'll find it in your BWT property under Search Performance) shows my site earning 1,600 AI citations across 33 pages in the 90 days ending July 19, from Microsoft Copilot and partner assistants. That number was 671 in late April, around when this remediation arc started, and roughly 1,500 by late June.
A note on causality, because this is where AI-visibility content usually oversells: the citation growth is correlated with the extraction work, not cleanly attributed to it. Over the same window, I also shipped content, fixed retrieval issues, and grew regular search traffic.
What I can defend: the audit scores are fully causal (the same instrument, before and after, moved because of one commit), the mechanism is documented engine behavior (heading-based chunking), and the citations kept compounding after the fix. What I can't give you is a controlled experiment isolating six heading tags. Nobody really can.
What I Rejected, and Why
Selection bias is the failure mode of tutorials like this one, so here's what I considered and didn't do:

Rewriting the page copy: This is standard extraction advice. But I rejected it because the census showed a structural problem, not a prose problem. My authored sections already passed. Rewriting would have burned days and muddied the measurement.

Plain /
 demotion without ARIA: Two fewer attributes per element. I rejected this because it deletes real navigation structure for screen reader users. The audit wouldn't have noticed the difference, but people would've.

Stuffing FAQ schema on every page: F5 is worth 10 points and JSON-LD is cheap. I rejected this as the first move because it treats the symptom with metadata while leaving the false outline in place. Schema asserts what your page means but the DOM is what gets chunked. Fix the DOM first.

Auditing every page on the sitemap: Completeness is seductive. I rejected this because extraction fixes multiply retrieval, and most pages have little retrieval to multiply. Three index pages covered the highest-impression surfaces and every card component in one pass.

Chasing a 100 score as a standing target: After watching my homepage drift from 100 to 80 through an unrelated redesign while staying comfortably in the EXTRACTABLE band, I set the CI gate at the 75 threshold, not at 100. Gating at perfection turns every content experiment into a CI failure and teaches your team to ignore the gate.


FAQ
Does demoting headings hurt my regular SEO?
The headings that matter for search are the ones describing your document's own structure, and those stay untouched. What you're removing is markup that claimed other documents' titles as your outline.
My organic search impressions grew over the months following the fix. Nothing in Google's guidance requires card titles to be heading elements.
Is this just gaming one audit script?
The five checks encode how retrieval-augmented systems actually process pages: chunk by heading, embed chunks, and lift opening sentences of matching chunks. A false outline degrades that pipeline no matter whose script measures it. You're not optimizing for my auditor. Instead, you're fixing the DOM that every parser sees. The score is a proxy, which is exactly why Step 6 gates the band, not the number.
I use React or Vue, not Svelte. Does anything change?
Nothing structural. The bug lives in JSX and SFC templates identically (
{title}
 inside a Card.tsx), the grep in Step 4 finds it, and role="heading" with aria-level works in every framework because it's plain HTML.
What about the headings inside my actual articles?
Leave them as real 
/ elements. Article body headings are your document's structure and they're precisely what should be in the census. The demotion pattern applies only to components that surface other pages' titles: cards, teasers, related-post widgets, and navigation panels.
How often should I re-audit?
Continuously, which is what Step 6 buys you: the CI gate re-audits on every production deployment, so you never re-audit by hand again.
If you skip the gate, run the Step 2 script monthly and after any change to layout components, navigation, or templates. Content edits inside a page rarely move the score much. Component and template changes are what reshape the census, and those are exactly the changes nobody thinks to re-measure. My own 100 to 80 homepage drift came from a redesign, not from writing.
My score is low but I have no card components. Now what?
Then your failure class is authored, not structural: statement headings (rephrase into questions users type), openers outside the 40 to 60 word band (densify), or a missing FAQ block (add one). The census from Step 2 tells you which. The fixes are writing work rather than component work.
What You Accomplished
You measured a property of your site most owners have never seen: the heading census your components actually emit, and the extractability score it produces.
You traced low scores to the specific components responsible, applied a demotion pattern that satisfies extraction parsers and screen readers simultaneously, and wired a CI gate so the score can never silently regress again.
The wider context, from the first two guides in this series: measuring your AI citation rate across engines tells you whether you're being cited, and shipping an agent-facing surface with WebMCP prepares your site for agents that act rather than read.
This tutorial closes the loop in the middle: making the content you already have liftable. Retrieval determines whether engines see you, attribution determines whether they name you, and extraction, the stage you just audited, determines whether there's anything clean enough to quote.
Run the census on your top three pages this week. If your components are voting in your outline, you now know how to take the vote back.

Component	Emitted	Instances
`BlogCard.svelte`	post title	2
`BlogCardFeatured.svelte`	post title	1
`ProductCard.svelte`	product name	1
`ProjectCard.svelte`	project name	1
`JourneyCard.svelte`	milestone title	1

Page	Before fix (May)	After fix	Live re-audit (July 21)
Homepage	65 PARTIALLY-EXTRACTABLE	100 EXTRACTABLE	80 EXTRACTABLE
Blog index	below threshold	100 EXTRACTABLE	100 EXTRACTABLE
Topic hub	below threshold	100 EXTRACTABLE	100 EXTRACTABLE



 Web Scraping for Beginners 2026 
Beau Carnes — Wed, 10 Jun 2026 02:16:49 +0000
 If you have ever wanted to collect product data, monitor competitors, track SEO rankings, or build AI tools that pull information from the internet, you have likely run into the common frustrations of web scraping: broken scripts, rate limits, bot detection, and tedious CAPTCHAs.
We just published a new tutorial on the freeCodeCamp.org YouTube channel, featuring software developer and course creator Ania Kubow.
In this comprehensive, beginner-friendly course, Ania teaches you a much simpler, more efficient approach. Instead of building scrapers from scratch, you will learn how to leverage an API to handle the heavy lifting for you.
Throughout this tutorial, you will master the following:

How to bypass web scraping obstacles like bot protection and rate limits using SerpApi, the Web Search API.

How to extract structured JSON data directly from search engines like Google, Amazon, YouTube, and more.

How to use the Google Lens API to scrape images and visual matches.

How to build your own functional web application that searches for and downloads content locally to your computer.


By the end of this video, you will have the knowledge and the basic code necessary to turn internet data into actionable insights for your own projects.
Watch the full tutorial on the freeCodeCamp.org YouTube channel (1 -hour watch).

 


 Traditional Scraping vs AI Scraping: A Practical Guide for Developers and Data Teams 
Joel Olawanle — Thu, 16 Apr 2026 21:37:47 +0000
 Enormous amounts of data are constantly generated on the open web. Product prices change, job listings go live and get taken down, news articles are published, and company information gets updated.
For developers and teams that rely on this kind of data, the question has never been whether to scrape the web, but how to do so reliably over time.
For a long time, the approach has been straightforward. You inspect a page, write selectors, and extract the data using tools like BeautifulSoup or browser automation libraries like Playwright and Selenium. This works well, but it comes with a familiar problem: the moment the structure of a page changes, your scraper breaks and needs fixing.
Recently, a different approach has started gaining attention. Instead of writing selectors, you describe what you want and let the system figure out how to extract it. This is what people refer to as AI scraping.
Both approaches are widely used today, but they solve the problem in very different ways. This guide breaks down how each one works, where each one fits, and how to decide which approach makes sense for your use case.
Table of Contents

What is Traditional Web Scraping?

Traditional Scraping in Practice

What is AI Web Scraping?

AI Scraping in Practice

Traditional Scraping vs AI Scraping: When to Use Each


What is Traditional Web Scraping?
Traditional web scraping scraping is built on a simple idea that if a browser can load a page and display data to a user, then a program should be able to do the same and extract that data automatically.
This is done with CSS selectors and XPath. For CSS selectors, a selector like .product-card .price means “find the price element inside a product card.” It's easy to understand and works well for most use cases.
XPath, on the other hand, is more powerful but more complex. It allows you to navigate the structure of a page in more detail, including moving up and down the DOM, filtering by text, or handling deeply nested elements.
In practice, most developers start with CSS selectors and only use XPath when the structure becomes too complex.
This idea has been around since the early days of the web. Instead of manually copying information from a page, developers started writing scripts that send requests, receive HTML responses, and extract the pieces they care about.
At its core, nothing about that model has really changed.
You still fetch a page, inspect its structure, and extract data from it. The difference today is not the concept, but how sophisticated the tooling and scale have become.
The Tools Behind Traditional Scraping
Over time, a solid ecosystem of tools has developed around this approach.

Requests is the de facto Python library for making HTTP calls. Most traditional scrapers use requests to fetch pages and then pass the response to BeautifulSoup for parsing. It's simple and reliable for static sites.

BeautifulSoup is a Python library for parsing HTML and XML. It takes raw HTML and builds a navigable tree of objects from it. It's fast to learn, very readable, and excellent for static pages. Its main limitation is that it has no browser engine, so it can't execute JavaScript. If a site renders content dynamically after page load, BeautifulSoup will see an empty container.

Selenium and Playwright are browser automation tools that control a real browser. They can click buttons, scroll, and wait for JavaScript to finish loading before extracting data. The trade-off is that they are slower and more resource-intensive than simple HTTP requests, but they are necessary for dynamic sites.


Traditional Scraping in Practice
Let's build a real, working scraper using Books to Scrape, a sandbox site built specifically for practicing web scraping. The goal is to extract the title, price, and star rating for every book listed on the first page.
Step 1: Install Dependencies
pip install requests beautifulsoup4

Step 2: Inspect the Page
Before writing a single line of code, open the target page in your browser and inspect its HTML. Right-click any book title and choose "Inspect" to see the structure.


You'll notice each book lives inside an 
 element, and within it:

The title is in the 
 tag, inside an  element (as a title attribute)

The price is in a 
 element

The star rating is encoded in the CSS class of a 
 element — for example, 
 means three stars


This is the core detective work of traditional scraping: you study the HTML, find the patterns, and write selectors to match them.
Step 3: Write the Scraper
import requests
from bs4 import BeautifulSoup

# 1. Fetch the page
url = "https://books.toscrape.com/"
response = requests.get(url)

# Always check the request succeeded before going further
if response.status_code != 200:
    print(f"Failed to fetch page: {response.status_code}")
    exit()

# 2. Parse the HTML
soup = BeautifulSoup(response.content, "html.parser")

# 3. Find all book containers on the page
books = soup.select("article.product_pod")

# 4. Extract data from each book
results = []

for book in books:
    # Title is stored as an attribute, not visible text
    title = book.select_one("h3 a")["title"]

    # Price is the text inside the price element
    price = book.select_one("p.price_color").get_text(strip=True)

    # Rating is encoded as a word in the CSS class: "star-rating Three"
    # We grab the second class name and map it to a number
    rating_word = book.select_one("p.star-rating")["class"][1]
    rating_map = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
    rating = rating_map.get(rating_word, 0)

    results.append({
        "title": title,
        "price": price,
        "rating": rating
    })

# 5. Display results
for book in results:
    print(f"{book['title']} | {book['price']} | {book['rating']} stars")

Step 4: Run It
python scraper.py

Your output will look something like this:
A Light in the Attic | £51.77 | 3 stars
Tipping the Velvet | £53.74 | 1 stars
Soumission | £50.10 | 1 stars
Sharp Objects | £47.82 | 4 stars
Sapiens: A Brief History of Humankind | £54.23 | 5 stars
...

Twenty books, all structured and clean.
Step 5: Extend It to Multiple Pages
The site has 50 pages. Extending the scraper to crawl all of them requires following the "next" button:
import requests
from bs4 import BeautifulSoup

BASE_URL = "https://books.toscrape.com/catalogue/"
start_url = "https://books.toscrape.com/catalogue/page-1.html"

all_books = []
url = start_url

while url:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    for book in soup.select("article.product_pod"):
        title = book.select_one("h3 a")["title"]
        price = book.select_one("p.price_color").get_text(strip=True)
        rating_word = book.select_one("p.star-rating")["class"][1]
        rating_map = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
        rating = rating_map.get(rating_word, 0)
        all_books.append({"title": title, "price": price, "rating": rating})

    # Check for a "next" button and follow it
    next_btn = soup.select_one("li.next a")
    url = BASE_URL + next_btn["href"] if next_btn else None

print(f"Scraped {len(all_books)} books total.")

Running this crawls all 1,000 books across all 50 pages.
What Makes This Approach Fragile
This scraper works well today because books.toscrape.com is a static, stable sandbox. In production, the same approach has a well-known weakness: it's completely dependent on the HTML structure staying the same.
If the site's developer renames product_pod to book-card, or moves the price into a  instead of a , every selector breaks. You get no data, or worse, incorrect data with no error, and you only discover the breakage when someone notices the output looks wrong.
This is one of the problems AI scraping is designed to address.
What is AI Web Scraping?
Traditional scraping works by following the structure of a page. It looks for specific elements, class names, or patterns in the HTML and extracts data based on those rules.
AI-powered scraping approaches the same problem differently. Instead of relying only on structure, it focuses on understanding the content itself. It looks at a page and identifies what something represents, not just where it's located.
In a traditional scraper, you might write something like:
response.css(".product-card .price::text").get()

You're telling the system exactly where to look. But, with AI scraping, you describe the outcome:
Extract the product name, price, and availability for each item on this page.

The system reads the page, identifies what appears to be a product listing, extracts the relevant fields, and returns structured data.
What's Actually Happening Under the Hood
AI scraping can feel like magic at first, but it's built on a combination of familiar components.
At the core are large language models (LLMs) trained on vast amounts of text, including web content and HTML. Over time, they learn patterns such as what a product listing looks like, how prices are usually presented, or how job listings are structured.
When given a page, the model can recognize these patterns and map them to the fields you asked for.
But the model is only one part of the system. You still need something to load and interact with the page. That is where browser automation comes in. Most AI scraping tools rely on headless browsers like Chromium or frameworks like Playwright to render pages, execute JavaScript, and handle real-world behavior such as scrolling or clicking.
On top of that, there's a layer that interprets your input. When you write a prompt describing the data you want, the system translates that into an extraction task. It decides what parts of the page are relevant and how to structure the output.
Finally, the system formats the results into clean data, typically as JSON or CSV, so you can use them directly with minimal post-processing.
Note: Tools like ChatGPT can interpret content, but they're not scraping systems. They don't crawl pages, handle workflows, or run repeatable data extraction. AI scraping tools combine this intelligence with the infrastructure required to collect data reliably.
Popular Tools Behind AI Scraping
As AI scraping has grown more popular, a number of tools have emerged that make this approach accessible without requiring you to build everything from scratch.
For example:

Spidra takes a pretty direct approach to extraction. You describe the data you want, and it handles loading the page, interpreting the content, and returning structured results. It also manages things like navigation and interactions behind the scenes, which makes it useful when you want to extract data without worrying about selectors or maintaining scraping logic.

Firecrawl focuses on turning web pages into clean, structured content. Instead of extracting specific fields like price or title, it converts entire pages into formats like markdown or simplified JSON. This makes it especially useful when you want to feed web content into AI systems or work with it in a readable format without dealing with messy HTML.

Jina Reader is designed to simplify web pages into clean text. It strips away layout noise such as navigation, ads, and styling, and focuses on the actual content. This is helpful when your goal is to understand or process the information on a page rather than extract structured fields.

Bright Data AI scrapers combine AI-based extraction with a strong scraping infrastructure. They allow you to request structured data without writing selectors, while also handling challenges like blocking and scaling. This makes them more suitable for larger or more demanding scraping tasks.

Apify sits somewhere in between traditional and AI-driven scraping. It provides a full platform for building and running scrapers, and allows you to introduce AI where it makes sense, whether for extraction or post-processing. This makes it useful when you need more control over the entire pipeline.


In practice, these tools aren't trying to solve the exact same problem. Some focus on extracting structured data, others on cleaning content, and others on building full scraping workflows. The right choice depends on what you're trying to achieve, not just the tool itself.
AI Scraping in Practice
Let's run the same data collection task of extracting books from books.toscrape.com using an AI scraping tool. We'll use Spidra's API so you can see exactly what changes.
Step 1: Get an API Key
Sign up at spidra.io and create an API key from your dashboard. You'll use this key to authenticate every request.


Step 2: Understand the API Structure
Spidra's scrape endpoint accepts a JSON payload. The two most important fields are url (where to scrape) and prompt (what to extract, written in plain English). You can optionally specify the output format — JSON works best for structured data.
POST https://api.spidra.io/scrape
Authorization: Bearer YOUR_API_KEY
Content-Type: application/json

You see, we don't need selectors or HTML inspection. Just a URL and a description.
Step 3: Write a Single-Page Extraction
Here's the equivalent of our traditional scraper, written as an API call:
import requests
import json

API_KEY = "your_api_key_here"

payload = {
    "urls": [{"url": "https://books.toscrape.com/"}],
    "prompt": "Extract all books on this page. For each book, return the title, price, and star rating as a number from 1 to 5.",
    "output": "json"
}

response = requests.post(
    "https://api.spidra.io/scrape",
    headers={
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    },
    json=payload
)

data = response.json()
print(json.dumps(data, indent=2))

That's the entire scraper. No BeautifulSoup, no selector logic, and no HTML parsing.
Step 4: Understand the Output
The API returns a structured JSON response. Each book is represented as an object with the fields you described:
{
  "results": [
    {
      "title": "A Light in the Attic",
      "price": "£51.77",
      "rating": 3
    },
    {
      "title": "Tipping the Velvet",
      "price": "£53.74",
      "rating": 1
    },
    {
      "title": "Soumission",
      "price": "£50.10",
      "rating": 1
    }
    ...
  ]
}

The model identified the star rating encoding (star-rating Three → 3) without being told how ratings are represented. It understood the intent of "star rating as a number from 1 to 5" and handled the mapping itself.
Step 5: Use Actions for Multi-Step Workflows
Where AI scraping starts to show its real advantages is with workflows that would require significant engineering in a traditional scraper.
Suppose you want to visit each book's detail page and extract the full description and availability status (not just what's visible on the listing page).
In a traditional scraper, this means building a follow-link loop, managing state, handling errors on each detail page, and maintaining separate selectors for the detail page's different structure. In an AI scraper like Spidra, you can mimic a real human interaction with browser actions:
{
  "urls": [{
    "url": "https://books.toscrape.com/catalogue/category/books/mystery_3/index.html",
    "actions": [{
      "type":            "forEach",
      "observe":         "Find all book cards in the product grid",
      "mode":            "inline",
      "captureSelector": "article.product_pod",
      "maxItems":        10,
      "itemPrompt":      "Extract the book title, price, and star rating (One/Two/Three/Four/Five). Return as JSON: {title, price, star_rating}"
    }]
  }]
}

The system navigates to each book's page, reads the new content, extracts the additional fields, and returns them as part of the same result set.
You can also configure how you want your data to be:
{
  "urls": [{ "url": "https://jobs.example.com/senior-engineer" }],
  "prompt": "Extract the job details",
  "schema": {
    "type": "object",
    "required": ["title", "company", "remote", "employment_type"],
    "properties": {
      "title":           { "type": "string" },
      "company":         { "type": "string" },
      "location":        { "type": ["string", "null"] },
      "remote":          { "type": ["boolean", "null"] },
      "salary_min":      { "type": ["number", "null"] },
      "salary_max":      { "type": ["number", "null"] },
      "employment_type": {
        "type": ["string", "null"],
        "enum": ["full_time", "part_time", "contract", null]
      },
      "skills": {
        "type": "array",
        "items": { "type": "string" }
      }
    }
  }
}

There is more to these AI scrapers, like batch scraping, AI crawling, and lots more.
Where AI Scraping Earns Its Keep
Now suppose the site updates its frontend. The class product_pod gets renamed to book-card. The price moves into a different element.
In the traditional scraper, you get zero results and no error until you notice the data is missing. You then re-inspect the page, update the selectors, test, and redeploy.
In the AI scraper, you run the same prompt. The model isn't looking for product_pod or price_color. It's looking for content that resembles a product listing with pricing information. The layout change is invisible to the extraction logic.
This is the core operational advantage of the AI approach: structural changes to a page don't automatically break your extraction.
Traditional Scraping vs AI Scraping: When to Use Each
At this point, the difference between the two approaches is clear. The more important question is when each one actually makes sense in practice.
A simple way to think about it is this:



Scenario
Traditional Scraping
AI Scraping



Stable websites
✅ Best choice
✅ Works but may sometimes become an overkill


Frequently changing layouts
❌ Breaks often
✅ More resilient


Large-scale crawling
✅ More cost-efficient
✅ Efficient but can get expensive


Fast prototyping
❌ Slower setup
✅ Very fast


Non-technical users
❌ Requires coding
✅ More accessible


Full control & transparency
✅ High control
❌ Less transparent


Messy or inconsistent data
❌ Hard to maintain
✅ Easier to handle


Complex workflows (login, steps)
⚠️ Possible but manual
✅ Often built-in


In practice, it's not a cut-and-dry choice between the two. Traditional scraping works best when everything is predictable and stable. AI scraping becomes useful when things are messy, dynamic, or time-sensitive. Most real-world systems combine both approaches rather than relying on one alone.
Wrapping up
Web scraping is not going away. What's changing is how we approach it.
Traditional scraping gives you control and precision, but it can be fragile and time-consuming to maintain. AI scraping makes things faster and more flexible, especially when dealing with messy or constantly changing pages, but it comes with less transparency.
In practice, most real-world workflows are starting to combine both.
We're also beginning to see AI scraping tools integrate into larger systems, especially with AI agents and MCP-style setups, where scraping becomes something that can be triggered on demand rather than built from scratch each time.
The key takeaway is simple. Traditional scraping tells the system where the data is. AI scraping tells the system what the data means.
Knowing when to use each is what actually matters.
 


 How to Turn Websites into LLM-Ready Data Using Firecrawl 
Manish Shivanandhan — Wed, 22 Oct 2025 16:02:51 +0000
 If you’ve ever tried feeding web pages into an AI model, you know the pain.
Websites come with ads, navigation bars, and messy HTML. Before your Large Language Model (LLM) can understand the content, you must clean and format it.
That’s where Firecrawl makes life easy. It’s an open-source API tool that turns any website into neat, structured data ready for LLMs in seconds.
In this tutorial, we’ll look at two ways of using Firecrawl. One is through Firecrawl’s API (a paid API with a free tier) and the other is a self-hosted version.
Table of Contents

What Is Firecrawl?

Why LLMs Need Clean Data

Setting Up Firecrawl

Scraping a Single Page

Crawling an Entire Website

Extracting Structured Data with AI

Self-hosting Firecrawl using Sevalla

Use Cases

Conclusion


What Is Firecrawl?
Firecrawl is a web crawling and scraping service that helps developers collect clean data from websites. You give it a URL, and it returns the content in formats like Markdown, HTML, JSON, or even screenshots.

Unlike basic scrapers, Firecrawl understands complex websites that load content with JavaScript. It can crawl through links, follow pages, and handle the heavy lifting like proxies and anti-bot systems automatically.
In short, it does the hard part of web data collection, so you can focus on using that data for your AI or automation projects.
Why LLMs Need Clean Data
LLMs learn and respond based on the text you give them. If that text includes clutter like HTML tags, scripts, or irrelevant sections, the AI gets confused.
Clean, well-structured data helps the model stay focused on the real content, like the article body, product details, or documentation.
Firecrawl makes this process simple. Instead of spending hours building scrapers or cleaning text, you can get ready-to-use content in a single API call.
Setting Up Firecrawl
To get started, create an account on firecrawl.dev and grab your API key. Running Firecrawl on your machine includes setting up a server, Redis cache, and so on. So we’ll use the API key from firecrawl.dev to test the API.
We can also quickly test its capabilities in the UI of the website.
Let’s use https://freecodecamp.org as the domain to see if Firecrawl can return some results.

And yes, we can see several URLs scraped by Firecrawl.

Now let’s access Firecrawl using code. The free plan lets you scrape 500 pages, so its all we need to understand how it works.
You can use either the Python SDK, the Node.js SDK, or direct API requests with curl.
Here’s how you install the SDKs:
Python:
pip install firecrawl-py

Node.js:
npm install @mendable/firecrawl-js

Once installed, you just need to set your API key and you’re ready to crawl.
Scraping a Single Page
Let’s say you want to extract the main content from Firecrawl’s homepage. You can do this in just a few lines.
Python Example:
from firecrawl import Firecrawl

firecrawl = Firecrawl(api_key="fc-YOUR_API_KEY")

doc = firecrawl.scrape(
    "https://firecrawl.dev",
    formats=["markdown", "html"]
)

print(doc.markdown)

This script returns the cleaned version of the page in Markdown format, perfect for an LLM to read or analyze.
With this one command, you get the core text, free from HTML clutter.
Crawling an Entire Website
If you need data from multiple pages like a full documentation site, you can crawl the entire domain. Firecrawl finds all the links and scrapes them automatically.
Example API call:
curl -X POST https://api.firecrawl.dev/v2/crawl \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://docs.firecrawl.dev",
    "limit": 10,
    "scrapeOptions": {
      "formats": ["markdown", "html"]
    }
  }'

This starts a crawl job and returns a job ID. Once done, you can download all the scraped pages in clean, LLM-ready formats.
Extracting Structured Data with AI
One of Firecrawl’s best features is AI-powered extraction. You can ask Firecrawl to read a page and return structured data, like a product’s price, description, or reviews, in JSON format.
Example:
curl -X POST https://api.firecrawl.dev/v2/extract \
  -H 'Authorization: Bearer fc-YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "urls": ["https://firecrawl.dev/*"],
    "prompt": "Extract the company mission and whether it is open source.",
    "schema": {
      "type": "object",
      "properties": {
        "company_mission": { "type": "string" },
        "is_open_source": { "type": "boolean" }
      }
    }
  }'

Firecrawl uses a built-in LLM to read the content and fill in the structure automatically. You can even skip the schema and just provide a natural-language prompt, like:

“Extract all the pricing details and feature names from this page.”

This is ideal for AI pipelines, RAG (Retrieval-Augmented Generation) systems, or dashboards that rely on clean, structured data.
Self-hosting Firecrawl using Sevalla
Firecrawl is open source, which means you don’t have to pay for the API if you prefer full control. You can deploy it on your own server and customise it however you like.
You can install Firecrawl on your local machine by setting up a database, cache, and other required components. But this setup will only work for local projects and won’t allow you to build or deploy applications that use Firecrawl.
To install Firecrawl, you can choose any cloud provider like AWS, Heroku, or others to setup this project. But I will be using Sevalla.
Sevalla is a modern, usage-based Platform-as-a-service provider. It offers application hosting, database, object storage, and static site hosting for your projects.
I am using Sevalla for hosting for two reasons:

Every platform will charge you for creating a cloud resource. Sevalla comes with a $50 credit for us to use, so we won't incur any costs for this example.

Sevalla has a template for Firecrawl, so it simplifies the manual installation and setup for each resource you will need for Firecrawl.


Login to Sevalla and click on Templates. You can see Firecrawl as one of the templates.

Click “Deploy now” and choose a server in the pop-up, and click “Deploy”. Sevalla will start provisioning the resources we need for running our Firecrawl instance.

Once the deployment is complete, you will see three instances provisioned:

a Redis Cache

a server to run Playwright

The API application


Go to the Firecrawl-API application. Under the deployments section, click on “Visit app” once the deployment is complete.

You can now use your private endpoint in your applications. My API URL is https://firecrawl-api-56t8x.sevalla.app (this is a temporary URL – dont use this), so I can replace api.firecrawl.dev with this URL.
curl -X POST https://firecrawl-api-56t8x.sevalla.app/v2/extract \
  -H 'Content-Type: application/json' \
  -d '{
    "urls": ["https://firecrawl.dev/*"],
    "prompt": "Extract the company mission and whether it is open source.",
    "schema": {
      "type": "object",
      "properties": {
        "company_mission": { "type": "string" },
        "is_open_source": { "type": "boolean" }
      }
    }
  }'

If you want to run the project locally by installing applications like Redis, Postgresql, and Playwright, here’s a detailed guide.
Use Cases
Developers and data scientists use Firecrawl for a wide range of tasks. They often rely on it to turn documentation sites into training data for large language models, ensuring that their models can learn from accurate and well-organised sources.
Others use it to collect blog posts or news articles for sentiment analysis, helping them understand trends, opinions, or public reactions across the web.
Firecrawl is also valuable for monitoring web content changes, which is essential for research projects or compliance tracking where up-to-date information is critical.
Teams can also use it to build “chat with your website” AI assistants that can answer questions based on the latest site content.
In each of these cases, Firecrawl ensures that your model receives clean, structured, and consistent data, making it easier to build reliable and intelligent AI systems.
Conclusion
Turning messy websites into readable text used to be one of the toughest parts of building AI systems. Firecrawl changes that. With one API call, you can scrape, crawl, and extract high-quality data that your LLM can immediately understand.
If you’re building anything related to AI, RAG, or data pipelines, Firecrawl is one of those tools you’ll wish you had discovered earlier.
 


 How to Use Python to Build Your Own Web Scraper 
freeCodeCamp — Wed, 10 Jul 2024 13:11:06 +0000
 By Jess Wilk
What is Web scraping?
Web scraping is a technique used to collect large amounts of data automatically using a programming script. This makes it useful for many professionals such as data analysts, market researchers, SEO specialists, business analysts, and academic researchers.
What You'll Learn Here
Python provides two libraries, Requests and Beautiful Soup, that help you scrape websites more easily. The combined use of Python's Requests and Beautiful Soup can retrieve HTML content from a website and then parse it to extract the data you need. In this article, I'll show you how to use these libraries with an example.
By the end of this guide, you will be equipped to build your own Web Scraper and have a more profound understanding of working with a large amount of data and how to apply it to make data-driven decisions.
Please note that while a web scraper is a useful tool, make sure you're compliant with all legal guidelines. This involves respecting the website's robots.txt file and adhering to the terms of service so you avoid unauthorized data extraction. 
Also, before scraping, make sure that the scraping process does not harm the website's functionality or overload its servers. Finally, respect data privacy by not scraping personal or sensitive information without proper consent.
How Beautiful Soup and Python Requests Work Together
Let’s understand the role of each library. 
The Python Requests library is responsible for fetching HTML content from the URL you provide in the script. Once it retrieves the content, it stores the data in a response object. 
Beautiful Soup then takes over, transforming the raw HTML from the Requests response into a structured format and parsing it. You can then scrape data from the parsed HTML by specifying attributes, allowing you to automate the collection of specific data from websites or repositories.
But this duo has its limitations. The Requests library can’t handle websites with dynamic JavaScript content. So you should use it primarily for sites that serve static content from servers. If you need to scrape a dynamically loaded site, you will have to use more advanced automation tools like Selenium.
How to Build a Web Scraper with Python
Now that we understand what Beautiful Soup and Python Requests can do, let’s discuss how we can scrape data using these tools.
In the following example, we’ll be scraping data from the UC Irvine Machine Learning Repository. 

Datasets at the UC Irvine Machine Learning Repository
As you can see, it contains many datasets, and you can find further details about each dataset by going to a dedicated page for the dataset. You can access the dedicated page by clicking on the dataset name in the list above. 
Check out the image below to get an idea of the information provided for each dataset.

Iris dataset
The code we write below will go through each dataset, scrape the details, and save them to a CSV file.
Prerequisites
To try out this tutorial, you need several prerequisites set up.
I am assuming you already have a Python installation on your machine. If not, please download the latest Python from the official website.
The Requests and Beautiful Soup libraries don't come with Python. You will have to install them separately. For this, you can use the pip package manager which is included by default with Python installation since Python 3.4.
You can use pip to install the Requests and Beautiful Soup libraries using the following commands:
pip install requests
pip install beautifulsoup4

If they were successfully installed, now you are ready to start coding.
Step 1: Import Necessary Libraries
First, import the necessary libraries: Requests for making HTTP requests, BeautifulSoup for parsing HTML content (if you don't already have it installed from the previous step), and CSV for saving the data.
import requests
from bs4 import BeautifulSoup
import csv

Step 2: Define the Base URL and CSV Headers
Set the base URL for the dataset listings and define the headers for the CSV file where the scraped data will be saved.
def scrape_uci_datasets():
    base_url = "https://archive.ics.uci.edu/datasets"


    headers = [
        "Dataset Name", "Donated Date", "Description",
        "Dataset Characteristics", "Subject Area", "Associated Tasks",
        "Feature Type", "Instances", "Features"
    ]


    data = []

Step 3: Create a Function to Scrape Dataset Details
Define a function scrape_dataset_details that takes the URL of an individual dataset page, retrieves the HTML content, parses it using BeautifulSoup, and extracts relevant information.

    def scrape_dataset_details(dataset_url):
        response = requests.get(dataset_url)
        soup = BeautifulSoup(response.text, 'html.parser')


        dataset_name = soup.find(
            'h1', class_='text-3xl font-semibold text-primary-content')
        dataset_name = dataset_name.text.strip() if dataset_name else "N/A"


        donated_date = soup.find('h2', class_='text-sm text-primary-content')
        donated_date = donated_date.text.strip().replace(
            'Donated on ', '') if donated_date else "N/A"


        description = soup.find('p', class_='svelte-17wf9gp')
        description = description.text.strip() if description else "N/A"


        details = soup.find_all('div', class_='col-span-4')


        dataset_characteristics = details[0].find('p').text.strip() if len(
            details) > 0 else "N/A"
        subject_area = details[1].find('p').text.strip() if len(
            details) > 1 else "N/A"
        associated_tasks = details[2].find('p').text.strip() if len(
            details) > 2 else "N/A"
        feature_type = details[3].find('p').text.strip() if len(
            details) > 3 else "N/A"
        instances = details[4].find('p').text.strip() if len(
            details) > 4 else "N/A"
        features = details[5].find('p').text.strip() if len(
            details) > 5 else "N/A"


        return [
            dataset_name, donated_date, description, dataset_characteristics,
            subject_area, associated_tasks, feature_type, instances, features
        ]

The scrape_dataset_details function retrieves the HTML content of a dataset page and parses it using BeautifulSoup. It extracts information by targeting specific HTML elements based on their tags and classes, such as dataset names, donation dates, and descriptions. 
The function uses methods like find and find_all to locate these elements and retrieve their text content, handling cases where elements might be missing by providing default values. 
This systematic approach ensures that the relevant details are accurately captured and returned in a structured format.
Step 4: Create a Function to Scrape Dataset Listings
Define a function scrape_datasets that takes the URL of a page listing multiple datasets, retrieves the HTML content, and finds all dataset links. For each link, it calls scrape_dataset_details to get detailed information.
    def scrape_datasets(page_url):
        response = requests.get(page_url)
        soup = BeautifulSoup(response.text, 'html.parser')


        dataset_list = soup.find_all(
            'a', class_='link-hover link text-xl font-semibold')


        if not dataset_list:
            print("No dataset links found")
            return


        for dataset in dataset_list:
            dataset_link = "https://archive.ics.uci.edu" + dataset['href']
            print(f"Scraping details for {dataset.text.strip()}...")
            dataset_details = scrape_dataset_details(dataset_link)
            data.append(dataset_details)

Step 5: Loop Through Pages Using Pagination Parameters
Implement a loop to navigate through the pages using pagination parameters. The loop continues until no new data is added, indicating that all pages have been scraped.
    skip = 0
    take = 10
    while True:
        page_url = f"https://archive.ics.uci.edu/datasets?skip={skip}&take={take}&sort=desc&orderBy=NumHits&search="
        print(f"Scraping page: {page_url}")
        initial_data_count = len(data)
        scrape_datasets(page_url)
        if len(
                data
        ) == initial_data_count:  
            break
        skip += take

Step 6: Save the Scraped Data to a CSV File
After scraping all the data, save it to a CSV file.
    with open('uci_datasets.csv', 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(headers)
        writer.writerows(data)


    print("Scraping complete. Data saved to 'uci_datasets.csv'.")

Step 7: Run the Scraping Function
Finally, call the scrape_uci_datasets function to start the scraping process.
scrape_uci_datasets()

Full Code
Here is the complete code for the web scraper:
import requests
from bs4 import BeautifulSoup
import csv


def scrape_uci_datasets():
    base_url = "https://archive.ics.uci.edu/datasets"


    headers = [
        "Dataset Name", "Donated Date", "Description",
        "Dataset Characteristics", "Subject Area", "Associated Tasks",
        "Feature Type", "Instances", "Features"
    ]


    # List to store the scraped data
    data = []


    def scrape_dataset_details(dataset_url):
        response = requests.get(dataset_url)
        soup = BeautifulSoup(response.text, 'html.parser')


        dataset_name = soup.find(
            'h1', class_='text-3xl font-semibold text-primary-content')
        dataset_name = dataset_name.text.strip() if dataset_name else "N/A"


        donated_date = soup.find('h2', class_='text-sm text-primary-content')
        donated_date = donated_date.text.strip().replace(
            'Donated on ', '') if donated_date else "N/A"


        description = soup.find('p', class_='svelte-17wf9gp')
        description = description.text.strip() if description else "N/A"


        details = soup.find_all('div', class_='col-span-4')


        dataset_characteristics = details[0].find('p').text.strip() if len(
            details) > 0 else "N/A"
        subject_area = details[1].find('p').text.strip() if len(
            details) > 1 else "N/A"
        associated_tasks = details[2].find('p').text.strip() if len(
            details) > 2 else "N/A"
        feature_type = details[3].find('p').text.strip() if len(
            details) > 3 else "N/A"
        instances = details[4].find('p').text.strip() if len(
            details) > 4 else "N/A"
        features = details[5].find('p').text.strip() if len(
            details) > 5 else "N/A"


        return [
            dataset_name, donated_date, description, dataset_characteristics,
            subject_area, associated_tasks, feature_type, instances, features
        ]


    def scrape_datasets(page_url):
        response = requests.get(page_url)
        soup = BeautifulSoup(response.text, 'html.parser')


        dataset_list = soup.find_all(
            'a', class_='link-hover link text-xl font-semibold')


        if not dataset_list:
            print("No dataset links found")
            return


        for dataset in dataset_list:
            dataset_link = "https://archive.ics.uci.edu" + dataset['href']
            print(f"Scraping details for {dataset.text.strip()}...")
            dataset_details = scrape_dataset_details(dataset_link)
            data.append(dataset_details)


    # Loop through the pages using the pagination parameters
    skip = 0
    take = 10
    while True:
        page_url = f"https://archive.ics.uci.edu/datasets?skip={skip}&take={take}&sort=desc&orderBy=NumHits&search="
        print(f"Scraping page: {page_url}")
        initial_data_count = len(data)
        scrape_datasets(page_url)
        if len(
                data
        ) == initial_data_count: 
            break
        skip += take


    with open('uci_datasets.csv', 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(headers)
        writer.writerows(data)


    print("Scraping complete. Data saved to 'uci_datasets.csv'.")


scrape_uci_datasets()

Once you run the script, it will run for a while until the terminal says “No dataset links found”, followed by “Scraping complete. Data saved to 'uci_datasets.csv'”, indicating that the scraped data has been saved in a CSV file.

To view the scraped data, open the 'uci_datasets.csv', you should be able to see the data organized by Dataset Name, Donated Date, Description, Characteristics, Subject Area, and so on.

Data organized by Dataset Name, Donated Date, Description, Characteristics, Subject Area, and so on.
You can have a better view of the data if you open the file via Excel.

Data organized in Excel file
By following the logic mentioned in this article, you can scrape many sites. All you need to do is start from the base URL, figure out how to navigate through the list, and go to the dedicated page for each list item. Then, identify suitable page elements like IDs and classes where you can isolate and extract the data you want. 
You also need to understand the logic behind pagination. Most often, pagination makes slight changes to the URL, which you can use to loop from one page to another. 
Finally, you can write the data to a CSV file, which is suitable for storing and as input for visualization.
Conclusion
Using Python along with Requests and Beautiful Soup allows you to create fully functional web scrapers to extract data from websites. While this functionality can be highly advantageous for data-driven decision-making, it is important to keep ethical and legal considerations in mind.
Once you become familiar with the methods used in this script, you can explore techniques like proxy management and data persistence. You can also familiarize yourself with other libraries like Scrapy, Selenium, and Puppeteer to fulfill your data collection needs. 
Thank you for reading! I'm Jess, and I'm an expert at Hyperskill. You can check out my Python developer course on the platform.
 


 How to Scrape Amazon Product Reviews Behind a Login 
freeCodeCamp — Mon, 30 Oct 2023 16:46:40 +0000
 By Satyam Tripathi
Amazon is the most popular e-commerce website for web scrapers, with billions of product pages being scraped every month. 
It is also home to a vast database of product reviews, which can be very useful for market research and competitor monitoring. 
You can extract relevant data from the Amazon website and save it in a spreadsheet or JSON format. And you can even automate the process to update the data regularly.
Scraping Amazon product reviews is not always straightforward, especially when a login is required. In this guide, you'll learn how to scrape Amazon product reviews behind a login. You’ll learn the process of logging in, parsing review data, and exporting reviews to CSV.
Important Disclaimer: This tutorial is for educational purposes only. Scraping data from behind logins on websites may violate their terms and conditions (T&Cs).  It's crucial to always check the T&Cs of any website before scraping data.
Without further ado, let's get started.
Prerequisites and Project Setup
We’ll use the Node.js Puppeteer library to scrape Amazon reviews. Make sure Node.js is installed on your system. If it is not, go to the official Node.js website and install it. 
After Node.js is installed, install Puppeteer. Puppeteer is a Node.js library that provides a high-level, user-friendly API for automating tasks and interacting with dynamic web pages. 
Now, let's install and configure Puppeteer.
Open a terminal and create a new folder with any name. (In my case, it is _amazonreviews).
mkdir amazon_reviews

Change your current directory to the folder created above.
cd amazon_reviews

Cool, you're now in the correct directory. Execute the following command to initialize the package.json file:
npm init -y

Finally, install Puppeteer using the following command:
npm install puppeteer

This is what the process looks like:

Now, open the folder in any code editor, and create a new JavaScript file (index.js). Make sure that the hierarchy looks like this:

_Hierarchy showing node_modules, index.js, package-lock.json, and package.json_
All set up successfully. We’re now ready to code the scraper.
Note: Ensure that you have an account on Amazon so you can progress through the rest of this tutorial.
Step 1: Get Access to the Public Page
You're going to scrape the reviews of the product shown below. You’ll extract the author's name, review title, and date.
Here's the product URL: https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/

The product we're using in the example - headphones
First, you’ll log in to Amazon, and then redirect to the product URL to scrape the reviews.
Step 2: Scrape Behind the Login
Amazon's multi-stage login process requires users to enter their username or email, click a Continue button to enter their password, and then finally submit it. Both the username and password fields are typically on different pages.
To enter the email ID, use the selector input[name=email].

HTML of the sign-in field
Now, click on the Continue button using the selector input[id=continue].

HTML of the continue button
Now you should be on the next page. To enter the password, use the selector input[name=password].

HTML of the password field
Finally, click on the Sign In button using the selector input[id=signInSubmit].

HTML of the sign-in button
Here’s the code for the login process:
const selectors = {
  emailid: 'input[name=email]',
  password: 'input[name=password]',
  continue: 'input[id=continue]',
  singin: 'input[id=signInSubmit]',
};


    await page.goto(signinURL);
    await page.waitForSelector(selectors.emailid);
    await page.type(selectors.emailid, "satyam@gmail.com", { delay: 100 });
    await page.click(selectors.continue);
    await page.waitForSelector(selectors.password);
    await page.type(selectors.password, "mypassword", { delay: 100 });
    await page.click(selectors.singin);
    await page.waitForNavigation();

We're following the same steps as discussed above. First, go to the sign-in URL, enter the email ID, and click on the Continue button. Then enter the password, click on the Sign In button, and wait for a moment for the sign-in process to complete.
After the sign-in process is completed, you’ll be redirected to the product page to scrape the reviews.

Product page
Step 3: Parse the Review Data
You've successfully logged in and are now on the product page that you want to scrape. Let's now parse the review data.
On the page, you'll find various reviews. These reviews are contained within a parent div with the ID cm-cr-dp-review-list, which holds all the reviews on the current page. If you want to access more reviews, you'll need to navigate to the next page using the pagination process.
This parent div has multiple child divs, and each child div holds one review. To extract the reviews, you can use the selector #cm-cr-dp-review-list div.review.
const selectors = {
  allReviews: '#cm-cr-dp-review-list div.review',
  authorName: 'div[data-hook="genome-widget"] span.a-profile-name',
  reviewTitle: '[data-hook=review-title]>span:not([class])',
  reviewDate: 'span[data-hook=review-date]',
};

This selector shows that you first go to the element with the ID cm-cr-dp-review-list, then search for all div elements with the data-hook review. 

Review data with Author name, Review Title, Description, etc.
The following code snippet shows that you should first go to the product URL, wait for the selector to load, and then scrape all the reviews and store them in the reviewElements variable.
await page.goto(productURL);
await page.waitForSelector(selectors.allReviews);
const reviewElements = await page.$$(selectors.allReviews);

Now, let's extract the author's name, review title, and date.

Targetting Author name, Review Title, and Date
To parse the author name, you can use the selector div[data-hook="genome-widget"] span.a-profile-name. This selector tells us to first search for the div element with the data-hook attribute set to genome-widget, because the names are inside this div element. Then, search for the span element with the class name a-profile-name. This is the element that contains the author's name.
const author = await reviewElement.$eval(selectors.authorName, (element) => element.textContent);

To parse the review title, you can use the CSS selector [data-hook="review-title"] > span:not([class]). This selector tells us to search for the span element that is a direct child of the [data-hook="review-title"] element and that does not have a class attribute.
const title = await reviewElement.$eval(selectors.reviewTitle, (element) => element.textContent);

To parse the date, you can use the CSS selector span[data-hook="review-date"]. This selector tells us to search for the span element that has the data-hook attribute set to review-date. This is the element that contains the review date.
const rawReviewDate = await reviewElement.$eval(selectors.reviewDate, (element) => element.textContent);

Note that you’ll get the entire text, including the location, instead of just the full date. Therefore, you must use a regular expression pattern to extract the date from the text. 
After that, combine all of the data into the reviewData and then push it to the final list reviewsData.
const datePattern = /(\w+\s\d{1,2},\s\d{4})/;
      const match = rawReviewDate.match(datePattern);
      const reviewDate = match ? match[0].replace(',', '') : "Date not found";

      const reviewData = {
        author,
        title,
        reviewDate,
      };

      reviewsData.push(reviewData);
    }

The above process will run until it has parsed all of the reviews on the current page. Here’s the code snippet to parse the data:
for (const reviewElement of reviewElements) {
      const author = await reviewElement.$eval(selectors.authorName, (element) => element.textContent);
      const title = await reviewElement.$eval(selectors.reviewTitle, (element) => element.textContent);
      const rawReviewDate = await reviewElement.$eval(selectors.reviewDate, (element) => element.textContent);

      const datePattern = /(\w+\s\d{1,2},\s\d{4})/;
      const match = rawReviewDate.match(datePattern);
      const reviewDate = match ? match[0].replace(',', '') : "Date not found";

      const reviewData = {
        author,
        title,
        reviewDate,
      };

      reviewsData.push(reviewData);
    }

Great! You’ve successfully parsed the relevant data, which is now in JSON format, as shown below:

Scraped the data in JSON format
Step 4: Export Reviews to a CSV
You've parsed the reviews in JSON format, which is a bit human-readable. You can convert this data to CSV format to make it more readable and easier for other purposes. 
There are many ways to convert JSON data to CSV, but we'll use a simple and effective approach. Here is a simple code snippet to convert JSON to CSV:
let csvContent = "Author,Title,Date\n
for (const review of reviewsData) {
      const { author, title, reviewDate } = review;
      csvContent += `${author},"${title}",${reviewDate}\n`;
    }

const csvFileName = "amazon_reviews.csv";
await fs.writeFileSync(csvFileName, csvContent, "utf8");

Here’s the output of the CSV file.

Converted JSON data into CSV format
And there you have it!
You can find the full Code uploaded on GitHub here.
Conclusion
In this guide, you learned how to scrape Amazon product reviews behind a login using Puppeteer. You learned how to log in, parse relevant data, and save it to a CSV file. 
To practice more, you can extract all the reviews of all the pages using pagination.
 


 Web Scraping with Google Sheets – How to Scrape Web Pages with Built-in Functions 
Eamonn Cottrell — Thu, 07 Sep 2023 21:14:07 +0000
 You read that right – you can practice web scraping without leaving your happy place: Google Sheets.
Google Sheets has five built-in functions that help you import data from other sheets and other web pages. We'll walk through all of them in order from easiest (most limited) to hardest (most powerful).
Here they are, and you can click each function to skip down to its dedicated section. I've made a video as well that walks through everything:

        
Section Shortcuts

How to use the IMPORTRANGE function
How to use the  IMPORTDATA function
How to use the IMPORTFEED function
How to use the IMPORTHTML function
How to use the IMPORTXML function

Here's the Google Sheet we'll be using to demo each function.
If you'd like to edit it, make a copy by selecting File - Make a copy when you open it.

screenshot of Google Sheet

How to use the IMPORTRANGE function
This is the only function that imports a range from another sheet rather than data from another web page. So, if you've got another Google Sheet, you can link the two sheets together and import the data you need from one sheet into the other sheet.
For instance, here's a sheet with a bunch of random Samsung Galaxy data in it.

You can see that we have a few hundred rows of data about phones. If we want to import this data into another spreadsheet, we can use IMPORTRANGE(). This is the simplest to use of the five functions we'll look at. All it needs is a URL for a Google Sheet and the range we want to import.
Check out the tab for IMPORTRANGE in the Google Sheet here, and you'll see that in cell A5, we've got the function =IMPORTRANGE(B4,"data!a1:K"). This is pulling in the range A1:K from the data tab of our second spreadsheet whose URL is in cell B4.

screenshot of IMPORTRANGE function
Once your data is pulled into your spreadsheet, you can do one of two things. 

Leave it linked through the IMPORTRANGE function. This way, if your data source is going to be updated, you'll pull in the updated data.
Copy and CTRL+SHIFT+V to paste values only. This way, you have the raw data in your new spreadsheet and you won't have to be dependent on something changing with the URL down the road.


How to use the IMPORTDATA function
This is pretty straightforward. It'll import .csv or .tsv data from anywhere on the internet. These stand for Comma Separated Values and Tab Separated Values. 
.csv is the most commonly used file type for financial data that needs to be imported into spreadsheets and other programs. 
Like IMPORTRANGE, we only need a couple pieces of information for IMPORTDATA: the URL where the file lives, and the delimiter. There's also an optional variable for locale, but I found that it was unnecessary.
In fact, Google Sheets is pretty smart – you can leave off the delimiter too, and it will usually decipher what type of data (.csv or .tsv) lives at the URL.
You can see that I've found a New York government data website where there lives some winning lottery number data. I've put the URL for a .csv file in A5, and then used the function =IMPORTDATA(A5,",") to pull in the data from the .csv file.

Screenshot of IMPORTDATA function
You could alternatively download the .csv file and then select File - Import to bring in this data. But in the event that you do not have download permissions or simply want to get it straight from a site, IMPORTDATA works great.

How to use the IMPORTFEED function
This imports RSS feed data. If you're familiar with podcasting, you may recognize the term. Every podcast has an RSS feed which is a structured file full of XML data. 
Using the URL for the RSS feed, IMPORTFEED will pull in data about a podcast, news article, or blog from its RSS information.
This is the first function that begins to have a few more options at its disposal, too.
All that's required is the URL of a feed, and it'll bring in data from that feed. However, we can specify a few other parameters if we like. The options include:

[query]: this specifies which pieces of data to pull from the feed. We can select from options like "feed " where type can be title, description, author or URL. Same deal with "items " where type can be title, summary, URL or created.
[headers]: this will either bring in headers (TRUE) or not (FALSE)
[num_items]: this will specify how many items to return when using Query. (The docs state that if this isn't specified, all items currently published are returned, but I did not find this to be the case. I had to specify a larger number to get back more than a dozen or so).

You can see from the screenshots below that I am querying one of my feeds to pull in the episode titles and URLs. 
First, to get all the titles, I used IMPORTFEED(A3, "items title", TRUE, 50:

Screenshot of IMPORTFEED
Then, similarly for the URLs, I used IMPORTFEED(A3, "items url", TRUE, 50):

Screenshot of IMPORTFEED #2

How to use the IMPORTHTML function
Now we're getting into scraping data straight off of a web site. This will take a URL and then a query parameter where we specify to look for either a "table" or a "list".
This is followed by an index value representing which table or list to look for if there are multiple on the page. It is zero indexed, so input zero if you're looking for the first one.
IMPORTHTML looks through the HTML code on a website for 
 and  HTML elements.

<table>
    <thead>
        <tr>
            <th>table header 1th>
            <th>table header 2th>
        tr>
    thead>
    <tbody>
        <tr>
            <td>table data row 1 cell1td>
            <td>table data row 1 cell2td>
        tr>
        <tr>
            <td>table data row 2 cell1td>
            <td>table data row 2 cell2td>
        tr>
    tbody>
table>


<ol>
    <li>ordered item 1li>
    <li>ordered item 2li>
    <li>ordered item 2li>
ol>

<ul>
    <li>unordered item 1li>
    <li>unordered item 2li>
    <li>unordered item 3li>
ul>
In the sample sheet, I've got the URL for some stats about the Barkley Marathons in cell B3 and am then referencing that in A4's function: =IMPORTHTML(B3,"table",0).

Screenshot of IMPORTHTML
FYI, freeCodeCamp created ScrapePark as a place to practice web scraping, so you can use it for IMPORTHTML and IMPORTXML coming up next👇.
How to use the IMPORTXML function
We saved the best for last. This will look through websites and scrape darn near anything we want it too. It's complicated, though, because instead of importing all the table or list data like with IMPORTHTML, we write our queries using what's called XPath. 
XPath is an expression language itself used to query XML documents. We can write XPath expressions to have IMPORTXML scrape all kinds of things from an HTML page.
There are many resources to find the proper XPath expressions. Here's one that I used for this project.

screenshot of XPath cheat sheet
In the sheet for IMPORTHTML, I have several examples that I encourage you to click through and check out.
For example, using the function =IMPORTXML(A11,"//*[@class='post-card-title']") allows us to bring in all the titles of my articles because from inspecting the HTML on my author page here, I found that they all have the class post-card-title.

screenshot of inspecting a web page with dev tools
In the same way, we can use the function =IMPORTXML(A11,"//*[@class='post-card-title']//a/@href") to grab the URL slug of each of those articles.

screenshot of IMPORTXML
You'll notice that it does bring in the full URL, so as a bonus, we can simply prepend the domain to each of these. Here's the function for the first row which we can drag down to get all those proper URLs: ="https://www.freecodecamp.org"&B13

screenshot of prepending domain to slug
Follow Me
I hope this was helpful for you! I learned a lot myself, and enjoyed putting the video together. 
You can find me on YouTube: https://www.youtube.com/@eamonncottrell
And, I've got a newsletter here: https://got-sheet.beehiiv.com/

 Introducing ScrapePark.org – Practice Web Scraping Without Hurting Anyone 
Quincy Larson — Mon, 21 Aug 2023 19:50:15 +0000
 When I grew up in the 1990s, we skateboarded everywhere. If a parking lot had a ledge or a handrail, you'd better believe we waxed it and grinded our boards across it.
Fast forward to the 2020s – many cities have built public skate parks. These are safe, legal places where we can skateboard without getting hassled by business owners or the cops.
As a dad in his 40s, I'm not scraping concrete with a skateboard much these days. Instead, I'm scraping websites to gather data I can analyze or feed into a machine learning algorithm.
But scraping the web can sometimes overload the websites you're scraping. Even though it may not violate a website's Terms of Service, you should be very gentle when doing it.
If you scrape a website that's not prepared, or if your scraping code is inefficient, you might bring the entire website down.
There are few things more embarrassing than accidentally DDoS'ing a friendly website that's just trying to serve visitors on the open web.
That's why – in the spirit of public skate parks – freeCodeCamp created ScrapePark.org. Anyone can go there and practice web scraping techniques without worrying about hurting anyone.

ScrapePark.org's landing page where you can practice web scraping techniques
ScrapePark.org is a simple E-commerce-style page that we've built specifically to stand up to heavy traffic loads.
You can practice scraping:

tables
iframes
dropdown menus
links
lists
images
buttons
forms
image carousels
menu items
navigation bar items
and more

The entire project is open source. If you want to add some additional elements or pages to it, please be our guest.
We'd love for this to become the main place that people practice their scraping, so that we can spare those "mom & pop" websites from getting overloaded.
Hone your web scraping skills on ScrapePark.org so you can then use them more responsibly on the open web.
This is just one of many initiatives the freeCodeCamp community has been working on this year. We have a lot more cool projects in the works.
A huge thanks to freeCodeCamp teachers Estefania Cassingena Navone and Gustavo Juantorena for helping develop ScrapePark.org. They just published a Spanish-language course focused on Web Scraping. If you speak Spanish, check out the course. [2 hour watch]:

        
If you want an English-language scraping course, freeCodeCamp has you covered. Try searching "scraping course" on Google or YouTube and look for the good ones from freeCodeCamp. 😃
Happy scraping.
 

 How to Scrape Multiple Web Pages Using Python 
freeCodeCamp — Tue, 14 Feb 2023 19:48:00 +0000
 By Shittu Olumide
Data is all around us. Every website you visit includes data in a readable format that you can utilize for a project. 
And although you can easily copy and paste the data, the best approach for big amounts of data is to perform web scraping.
Learning web scraping can be tricky at first, but with a good web scraping library, things will become much easier. 
Web scraping can be a useful tool for gathering data and information, but it is important to ensure that you do it in a safe and legal manner. 
Here are some tips for performing web scraping properly:

Seek permission before you scrape a site.
Read and understand the website's terms of service and robots.txt file.
Limit the frequency of your scraping.
Use web scraping tools that respect website owners' terms of service.

Now that you understand the proper way to approach scraping, let's dive in. In this step-by-step tutorial, we will walk through how to scrape several pages of a website using Python's most user-friendly web scraping module, Beautiful Soup.
This tutorial will be divided into two portions: we will scrape a single page in the first phase. Then in the second section, we'll scrape several pages based on the code used in the first section.
Requirements
Python 3: you'll need to use Python 3 for this tutorial, because the library that we'll use is a Python library. To download and install Python check out the official website.
Beautiful Soup: Beautiful Soup is a Python package for structured data parsing. For parsed pages, it generates a parse tree that you can use to extract data from HTML. It lets you interact with HTML in the same way you can interact with a web page using developer tools. 
To begin using it, launch your terminal and install Beautiful Soup:
pip install beautifulsoup4

Requests library: The requests library is the Python standard for making HTTP requests. We'll use this in conjunction with Beautiful Soup to obtain the HTML for a website.
pip install requests

Install a parser: To extract data from HTML text, we need a parser. We'll utilize the lxml parser here. To install this parser, execute the following command:
pip install lxml

Note: You don't have to be a Python professional to follow this tutorial.
How to Scrape a Single Web Page
As I explained earlier, we will start by understanding how to scrape a single web page. Then we'll move on to scraping multiple web pages. 
Let's build our first scraper.
Import the libraries
First, let's import the libraries we'll need:
import requests
from bs4 import BeautifulSoup

Get the website HTML
We want to scrape a website with hundreds of pages of movie transcripts. We'll begin by scraping a single page, and then demonstrate how to scrape multiple pages.  
First, we'll define the connection. In this example, we'll use the Titanic movie transcript, but you can select any movie you wish. 
Then we make a request to the website and receive a response, which we record in the result variable. Following that, we'll use the .text method to retrieve the website's content. 
Finally, we'll use the lxml parser to get the soup, which is the object containing all of the data in the nested structure that we'll reuse later.
website = 'https://subslikescript.com/movie/Titanic-120338'

result = requests.get(website)
content = result.text
soup = BeautifulSoup(content, 'lxml')

print(soup.prettify())

Once we have the soup object, we can simply get readable HTML by using .prettify(). Although we may use the HTML printed in a text editor to find elements, it is far easier to go straight to the HTML code of the element we seek. We'll do this in the following phase. 
Examine the webpage and HTML code
Before we start writing code, we must first assess the website we want to scrape and the HTML code we got to identify the best strategy to scrape the website. A sample transcript is available below. The things to be scraped are the movie title and transcript.

Image showing the title and transcript of the titanic movie.
To get the HTML code for a given element, perform the following steps:

Navigate to the Titanic transcript's website.
Right-click on either the movie title or the transcript. You'll see a list. Select "Inspect" to view the page's source code.


Image showing page source code
How to find an element with Beautiful Soup
It's easy to find an element in Beautiful Soup. Simply apply the .find() method to the previously prepared soup.
As an example, find the box containing the movie title, description, and transcript. It's within an article tag and has the class main-article on it. We can use the following code to find that box:
box = soup.find('article', class_='main-article')

The movie title is enclosed in an h1 tag and lacks a class name. After we find it, we use the .get_text() function to retrieve the text within the node:
title = box.find('h1').get_text()

The transcript is included within a div tag and has the class full-script. In this scenario, we'll change the default arguments within the .get_text() function to get the text. 
We begin by setting strip=True to eliminate leading and trailing spaces. Then we add a blank space to the separator separator=' ' to ensure that words have a blank space after each new line \n.
transcript = box.find('div', class_='full-script')
transcript = transcript.get_text(strip=True, separator=' ')

So far, we've scraped the data successfully. Print the title and transcript variables to ensure that everything is operating properly.
How to export data into a .txt file
You can store data in CSV, JSON, and other formats. In this example, we'll save the extracted data in a.txt file. To accomplish this, we will use the with keyword, as shown in the code below:
with open(f'{title}.txt', 'w') as file:
    file.write(transcript)

Remember to use the f-string to set the file name as the movie title. After running the code, we should have a .txt file in our working directory.
We're ready to scrape transcripts from multiple pages now that we've successfully scraped data from one web page!
How to Scrape Multiple Web Pages
On the transcript page, scroll down and click on the all movie scripts. You can find it at the bottom of the web page.  

All transcripts page
The screenshot shows all of the movie transcripts. The website has 1,757 pages, with approximately 30 movie transcripts on each page. 
In this section, we will scrape multiple links by obtaining the href attribute of each link. First, we must modify the website to allow scrapin. Our new website variable will be as follows:
root = 'https://subslikescript.com'
website = f'{root}/movies'

The main reason why a root variable is defined in the code is to help scrape multiple web pages later.
How to get the href attribute
Let's start with the href attribute of the 30 movies on one page. Examine any movie title within the "List of Movie Transcripts" box. 
Following that, we should have the HTML code. An a tag should be highlighted in blue. Each a tag belongs to a movie title.

As we can see, the links within the href do not include the root domain subslikescript.com. This is why we created a root variable before concatenating it.
Let's look for all of the a elements on the page.
How to find multiple elements
In Beautiful Soup, we use the .find_all() method to locate multiple elements. To extract the link that corresponds to each movie transcript, we must include the parameter href=True.
box.find_all('a', href=True)

To get the links from the href, add ['href'] to the expression above. However, because the .find_all() method returns a list, we must loop through it and get the hrefs one by one within the loop.
for link in box.find_all('a', href=True):
    link['href']

We can use list comprehension to save the links, as shown below:
links = [link['href'] for link in box.find_all('a', href=True)]
print(links)

The links we want to scrape will be visible if you print the links list. In the following step, we'll scrape each page.
How to loop through each link
To scrape the transcript of each link, we'll repeat the steps we used for the first transcript. This time, we'll put those steps inside the for loop below.
for link in links:
    result = requests.get(f'{root}/{link}')
    content = result.text
    soup = BeautifulSoup(content, 'lxml')

As you may recall, the links we previously saved did not include the root subslikescript.com, so we must concatenate it with the expression f'{root}/{link}'.
The rest of the code is identical to what we wrote in the first section of this guide.
Wrapping up
If you want to browse through the web pages, you have two options.

Check any of the pages that are visible on the webpage (for example, 1, 2, 3, or 1757). Get the a tag with the href attribute along with the links for each page. When you have the links, combine them with the root and proceed as described in Section 2 after doing so.
Visit page 2 and copy the link you see there. This is how it ought to appear: subslikescript.com/movies?page=2. You can see that the website has a consistent format for each page: f'{website}?page={i}'. If you want to go through the first ten pages, you can reuse the website variable and loop between 1 and 10.

Lets connect on Twitter and on LinkedIn. You can also subscribe to my YouTube channel.
Happy Coding!
 

 Web Scraping in JavaScript – How to Use Puppeteer to Scrape Web Pages 
Gaël Thomas — Tue, 31 Jan 2023 15:26:55 +0000
 Welcome to the world of web scraping! Have you ever needed data from a website but found it hard to access it in a structured format? This is where web scraping comes in.
Using scripts, we can extract the data we need from a website for various purposes, such as creating databases, doing some analytics, and even more.

Disclaimer: Be careful when doing web scraping. Always make sure you're scraping sites that allow it, and performing this activity within ethical and legal limits.

JavaScript and Node.js offers various libraries that make web scraping easier. For simple data extraction, you can use Axios to fetch an API responses or a website HTML. 
But if you're looking to do more advanced tasks including automations, you'll need libraries such as Puppeteer, Cheerio, or Nightmare (don't worry the name is nightmare, but it's not that bad to use 😆).
I'll introduce the basics of web scraping in JavaScript and Node.js using Puppeteer in this article. I structured the writing to show you some basics of fetching information on a website and clicking a button (for example, moving to the next page).
At the end of this introduction, I'll recommend ways to practice and learn more by improving the project we just created.
Prerequisites
Before diving in and scraping our first page together using JavaScript, Node.js, and the HTML DOM, I'd recommend having a basic understanding of these technologies. It'll improve your learning and understanding of the topic.
Let's dive in! 🤿
How to Initialize Your First Puppeteer Scraper
New project...new folder! First, create the first-puppeteer-scraper-example folder on your computer. It'll contain the code of our future scraper.
mkdir first-puppeteer-scraper-example

Now, it's time to initialize your Node.js repository with a package.json file. It's helpful to add information to the repository and NPM packages, such as the Puppeteer library.
npm init -y

After typing this command, you should find this package.json file in your repository tree.
{
  "name": "first-puppeteer-scraper-example",
  "version": "1.0.0",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "dependencies": {
    "puppeteer": "^19.6.2"
  },
  "type": "module",
  "devDependencies": {},
  "description": ""
}

Before proceeding, we must ensure the project is configured to handle ES6 features. To do so, you can add the "types": "module" instruction at the end of the configuration.
{
  "name": "first-puppeteer-scraper-example",
  "version": "1.0.0",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "dependencies": {
    "puppeteer": "^19.6.2"
  },
  "type": "module",
  "description": "",
  "types": "module"
}

The last step of our scraper initialization is to install the Puppeteer library. Here's how:
npm install puppeteer

Wow! We're there – we're ready to scrape our first website together. 🤩
How to Scrape Your First Piece of Data
In this article, we'll use the ToScrape website as our learning platform. This online sandbox provides two projects specifically designed for web scraping, making it a great starting point to learn the basics such as data extraction and page navigation.
For this beginner's introduction, we'll specifically focus on the Quotes to Scrape website.
How to Initialize the Script
In the project repository root, you can create an index.js file. This will be our application entry point.
To keep it simple, our script consists of one function in charge of getting the website's quotes (getQuotes).
In the function's body, we will need to follow different steps:

Start a Puppeteer session with puppeteer.launch (it'll instantiate a browser variable that we'll use for manipulating the browser)
Open a new page/tab with browser.newPage (it'll instantiate a page variable that we'll use for manipulating the page)
Change the URL of our new page to http://quotes.toscrape.com/ with page.goto

Here's the commented version of the initial script:
import puppeteer from "puppeteer";

const getQuotes = async () => {
  // Start a Puppeteer session with:
  // - a visible browser (`headless: false` - easier to debug because you'll see the browser in action)
  // - no default viewport (`defaultViewport: null` - website page will in full width and height)
  const browser = await puppeteer.launch({
    headless: false,
    defaultViewport: null,
  });

  // Open a new page
  const page = await browser.newPage();

  // On this new page:
  // - open the "http://quotes.toscrape.com/" website
  // - wait until the dom content is loaded (HTML is ready)
  await page.goto("http://quotes.toscrape.com/", {
    waitUntil: "domcontentloaded",
  });
};

// Start the scraping
getQuotes();

What do you think of running our scraper and seeing the output? Let's do it with the command below:
node index.js

After doing this, you should have a brand new browser application started with a new page and the website Quotes to Scrape loaded onto it. Magic, isn't it? 🪄

Quotes to Scrape homepage loaded by our initial script
Note: For this first iteration, we're not closing the browser. This means you will need to close the browser to stop the running application.
How to Fetch the First Quote
Whenever you want to scrape a website, you'll have to play with the HTML DOM. What I recommend is to inspect the page and start navigating the different elements to find what you need.
In our case, we'll follow the baby step principle and start fetching the first quote, author, and text.
After browsing the page HTML, we can notice a quote is encapsulated in a 
 element with a class name quote (class="quote"). This is important information because the scraping works with CSS selectors (for example, .quote).

Browser inspector with the first quote <div> selected

An example of how each quote is rendered in the HTML
Now that we have this knowledge, we can return to our getQuotes function and improve our code to select the first quote and extract its data.
We will need to add the following after the page.goto instruction:

Extract data from our page HTML with page.evaluate (it'll execute the function passed as a parameter in the page context and returns the result)
Get the quote HTML node with document.querySelector (it'll fetch the first  with the classname quote and returns it)
Get the quote text and author from the previously extracted quote HTML node with quote.querySelector (it'll extract the elements with the classname text and author under  and returns them)

Here's the updated version with detailed comments:
import puppeteer from "puppeteer";

const getQuotes = async () => {
  // Start a Puppeteer session with:
  // - a visible browser (`headless: false` - easier to debug because you'll see the browser in action)
  // - no default viewport (`defaultViewport: null` - website page will in full width and height)
  const browser = await puppeteer.launch({
    headless: false,
    defaultViewport: null,
  });

  // Open a new page
  const page = await browser.newPage();

  // On this new page:
  // - open the "http://quotes.toscrape.com/" website
  // - wait until the dom content is loaded (HTML is ready)
  await page.goto("http://quotes.toscrape.com/", {
    waitUntil: "domcontentloaded",
  });

  // Get page data
  const quotes = await page.evaluate(() => {
    // Fetch the first element with class "quote"
    const quote = document.querySelector(".quote");

    // Fetch the sub-elements from the previously fetched quote element
    // Get the displayed text and return it (`.innerText`)
    const text = quote.querySelector(".text").innerText;
    const author = quote.querySelector(".author").innerText;

    return { text, author };
  });

  // Display the quotes
  console.log(quotes);

  // Close the browser
  await browser.close();
};

// Start the scraping
getQuotes();

Something interesting to point out is that the function name for selecting an element is the same as in the browser inspect. Here's an example:

After running the document.querySelector instruction in the browser inspector, we have the first quote as an output (like on Puppeteer)
Let's run our script one more time and see what we have as an output:
{
  text: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
  author: 'Albert Einstein'
}

We did it! Our first scraped element is here, right in the terminal. Now, let's expand it and fetch all the current page quotes. 🔥
How to Fetch All Current Page Quotes
Now that we know how to fetch one quote, let's trick our code a bit to get all the quotes and extract their data one by one.
Previously we used document.getQuerySelector to select the first matching element (the first quote). To be able to fetch all quotes, we will need the document.querySelectorAll function instead.
We'll need to follow these steps to make it work:

Replace document.getQuerySelector with document.querySelectorAll (it'll fetch all  elements with the classname quote and return them)
Convert the fetched elements to a list with Array.from(quoteList) (it'll ensure the list of quotes is iterable)
Move our previous code to get the quote text and author inside the loop and return the result (it'll extract the elements with the classname text and author under  for each quote)

Here's the code update:
import puppeteer from "puppeteer";

const getQuotes = async () => {
  // Start a Puppeteer session with:
  // - a visible browser (`headless: false` - easier to debug because you'll see the browser in action)
  // - no default viewport (`defaultViewport: null` - website page will be in full width and height)
  const browser = await puppeteer.launch({
    headless: false,
    defaultViewport: null,
  });

  // Open a new page
  const page = await browser.newPage();

  // On this new page:
  // - open the "http://quotes.toscrape.com/" website
  // - wait until the dom content is loaded (HTML is ready)
  await page.goto("http://quotes.toscrape.com/", {
    waitUntil: "domcontentloaded",
  });

  // Get page data
  const quotes = await page.evaluate(() => {
    // Fetch the first element with class "quote"
    // Get the displayed text and returns it
    const quoteList = document.querySelectorAll(".quote");

    // Convert the quoteList to an iterable array
    // For each quote fetch the text and author
    return Array.from(quoteList).map((quote) => {
      // Fetch the sub-elements from the previously fetched quote element
      // Get the displayed text and return it (`.innerText`)
      const text = quote.querySelector(".text").innerText;
      const author = quote.querySelector(".author").innerText;

      return { text, author };
    });
  });

  // Display the quotes
  console.log(quotes);

  // Close the browser
  await browser.close();
};

// Start the scraping
getQuotes();

As an end result, if we run our script one more time, we should have a list of quotes as an output. Each element of this list should have a text and an author property.
[
  {
    text: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
    author: 'Albert Einstein'
  },
  {
    text: '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
    author: 'J.K. Rowling'
  },
  {
    text: '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
    author: 'Albert Einstein'
  },
  {
    text: '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
    author: 'Jane Austen'
  },
  {
    text: "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
    author: 'Marilyn Monroe'
  },
  {
    text: '“Try not to become a man of success. Rather become a man of value.”',
    author: 'Albert Einstein'
  },
  {
    text: '“It is better to be hated for what you are than to be loved for what you are not.”',
    author: 'André Gide'
  },
  {
    text: "“I have not failed. I've just found 10,000 ways that won't work.”",
    author: 'Thomas A. Edison'
  },
  {
    text: "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
    author: 'Eleanor Roosevelt'
  },
  {
    text: '“A day without sunshine is like, you know, night.”',
    author: 'Steve Martin'
  }
]

Good job! All the quotes from the first page are now scraped by our script. 👏
How to Move to the Next Page
Our script is now able to fetch all the quotes for one page. What would be interesting is clicking on the "Next page" at the page bottom and doing the same on the second page.

"Next" button at the Quotes to Scrape page bottom
Back to our browser inspect, and let's find how we can target this element using CSS selectors. 
As we can notice, the next button is placed under an unordered list 
 with a pager classname (). This list has an element  with a next classname (
). Finally, there is a link anchor  that links to the second page ().
In CSS, if we want to target this specific link there are different ways to do that. We can do:

.next > a: but, it's risky because if there is an other element with .next as a parent element containing a link, it'll click on it.
.pager > .next > a: safer, because we make sure the link should be inside the .pager parent element under the .next element. There is a low risk of having this hierarchy more than once.


An example of how the "Next" button is rendered in the HTML
To click this button, at the end of our script after the console.log(quotes);, you can add the following: await page.click(".pager > .next > a");.
Since we're now closing the browser page with await browser.close(); after all instructions are done, you need to comment on this instruction to see the second page opened in the scraper browser.
It's temporary and for testing purposes, but the end of our getQuotes function should look like this:
  // Display the quotes
  console.log(quotes);

  // Click on the "Next page" button
  await page.click(".pager > .next > a");

  // Close the browser
  // await browser.close();

After this, if you run our scraper again, after processing all instructions, your browser should stop on the second page:

Quotes to Scrape second page loaded after clicking the "Next" button
It’s Your Time! Here’s What You Can Do Next:
Congrats on reaching the end of this introduction to scraping with Puppeteer! 👏
Now it's your turn to improve the scraper and make it get more data from the Quotes to Scrape website. Here's a list of potential improvements you can make:

Navigate between all pages using the "Next" button and fetch the quotes on all the pages.
Fetch the quote's tags (each quote has a list of tags).
Scrape the author's about page (by clicking on the author's name on each quote).
Categorize the quotes by tags or authors (it's not 100% related to the scraping itself, but that can be a good improvement).

Feel free to be creative and do any other things you see fit 🚀
Scraper Code Is Available on GitHub
Check out the latest version of our scraper on GitHub! You're free to save, fork, or utilize it as you see fit.
=> First Puppeteer Scraper (example)
Successful Scraping Start: Thanks for reading the article!
I hope this article gave you a valuable introduction to web scraping using JavaScript and Puppeteer. Writing this was a pleasure, and I hope you found it informative and enjoyable.
Join me on Twitter for more content like this. I regularly share content to help you grow your web development skills and would love to have you join the conversation. Let's learn, grow, and inspire each other along the way!
 

 How to Use Python to Scrape App Store Reviews 
freeCodeCamp — Fri, 16 Sep 2022 18:04:19 +0000
 By Shittu Olumide
Data scraping, commonly referred to as web scraping, is a technique for getting data and content from the internet.
You usually keep this information in a local file so that you can change and inspect it as needed. 
Web scraping is basically just copying and pasting content from a website into an Excel spreadsheet on a very small scale.
The main goal of this article is to help you get started in web scraping using quick and easy steps. You will learn how to scrape app store reviews using the app_store_scraper library in Python. There are other tools and libraries you can use such as Scrapy, Pandas, and BeautifulSoup ,but here we will use the use the  app_store_scraper. 
Depending on the mechanism you select for web scraping, it might be either really simple or quite complex. 
Fortunately, there is straightforward and excellent software that can help you gather reviews about your app from the Apple app store and use them for further sentiment analysis.
Why is web scraping even useful?
Data analytics professionals employ web scraping for a variety of tasks, including lead creation, market analysis, consumer sentiment analysis, and data integration.
You can also use web scraping to track stock prices, online opportunities (such as scholarships, employment, internships, and so on), competitors' inventory data, and customer reviews and ratings.  
In this article, you will learn how to use Python to scrape app store reviews in 4 easy steps. 
Before you start, here's something to keep in mind: some sites don't allow you to scrape their content, so be sure you check before doing so. Web scraping isn't precisely forbidden, but you should take care to know when/where you can scrape. I strongly recommend that you scrape for informational and educational purposes only.
Step 1 – Install and Setup Packages
First, you have to install and setup the necessary packages. In this step you will install the app_store_scraper using the Python package installer.
pip install app_store_scraper 

#or

pip3 install app_store_scraper

Step 2 – Get App's Name and ID
I will be using a random app and I will be scraping its reviews for the sake of this demo. But if have a personal app that you built and you have it on app store, you can use that app with these same techniques. You just need to get the app's name and ID, which you can find by typing the name of the app into Google using your PC. 
Example: "Slack app on apple app store"

You should click on the first result which will redirect you to the official Apple store. There you will find the "Slack app" and everything about it. 
Once the page loads in the URL you will see the app name (Slack) and app ID (618783545). Copy it down in your notepad.

Now you'll need to import some packages and run some code:
import pandas as pd
import numpy as np
import json

from app_store_scraper import AppStore
slack = AppStore(country='us', app_name='slack', app_id = '618783545')

slack.review(how_many=2000)

In the code above, you will import the pandas library which helps you add evaluations/reviews to a dataframe. You'll also import the numpy library for data transformation and modification. Finally, you'll get the app_store_scraper package itself for scraping the reviews from the website. 
You will have to create and instance of the Appstore class, then pass in the arguments country, app_name, and the app_id. 

slack app ratings
The reviews are all stored in the slack variable, so run the command below to see the reviews stored in JSON format.
slack.reviews


slack app scraped reviews
Step 3 – Convert Data from JSON
To make data more readable and properly formatted, you need to convert it from JSON format to a Pandas dataframe. You can do that with the following code:
slackdf = pd.DataFrame(np.array(slack.reviews),columns=['review'])
slackdf2 = df.join(pd.DataFrame(slackdf.pop('review').tolist()))
slackdf2.head()


generated reviews in pandas dataframe
Step 4 – Convert the Dataframe to CSV
Here is the final step: you will covert the dataframe into CSV (comma-separated value) format so that you can have it on your local machine. Then you can view it in a spreadsheet and also share it with a colleague.
slackdf2.to_csv('Slack-app-reviews.csv')

Finally, you should have your "Slack-app-reviews.csv" file saved into your project folder and you're ready to go. 
Conclusion
In this short article, you were able to scrape Slack app store reviews into a dataframe and then save it into your local machine using 4 easy steps. I hope you enjoyed it, cheers.
Here is the GitHub repo where I hosted the code, feel free to star the repository.
 

 Web Scraping in Python – How to Scrape Sci-Fi Movies from IMDB 
freeCodeCamp — Tue, 09 Aug 2022 19:50:10 +0000
 By Riley Predum
Have you ever struggled to find a dataset for your data science project? If you're like I am, the answer is yes. 
Luckily, there are many free datasets available – but sometimes you want something more specific or bespoke. For that, web scraping is a good skill to have in your toolbox to pull data off your favorite website.
What’s Covered in this Article?
This article has a Python script you can use to scrape the data on sci-fi movies (or whatever genre you choose!) from the IMDB website. It can then write these data to a dataframe for further exploration. 
I will conclude this article with a bit of exploratory data analysis (EDA). Through this, you will see what further data science projects are possible for you to try.
Disclaimer: while web scraping is a great way of programmatically pulling data off of websites, please do so responsibly. My script uses the sleep function, for example, to slow down the pull requests intentionally, so as not to overload IMDB's servers. Some websites frown upon the use of web scrapers, so use it wisely.
Web Scraping and Data Cleaning Script
Let’s get to the scraping script and get that running. The script pulls in movie titles, years, ratings (PG-13, R, and so on), genres, runtimes, reviews, and votes for each movie. You can choose how many pages you want to scrape based on your data needs. 
_Note: it will take longer the more pages you select. It takes 40 min to scrape 200 webpages using the Google Colab Notebook._
For those of you who have not tried it before, Google Colab is a cloud-based Jupyter Notebook style Python development tool that lives in the Google app suite. You can use it out of the box with many of the packages already installed that are common in data science. 
Below is an image of the Colab workspace and its layout:

Introducing the Google Colab user interface
With that, let’s dive in! First things first, you should always import your packages as their own cell. If you forget a package you can re-run just that cell. This cuts down on development time. 
Note: some of these packages need pip install package_name to be run to install them first. If you choose to run the code locally using something like a Jupyter Notebook you'll need to do that. If you want to get up and running quickly, you can use the Google Colab notebook. This has all these installed by default.
from requests import get
from bs4 import BeautifulSoup
from warnings import warn
from time import sleep
from random import randint
import numpy as np, pandas as pd
import seaborn as sns

How to Do the Web Scraping
You can run the following code which does the actual web scraping. It will pull all the columns mentioned above into arrays and populate them one movie at a time, one page at a time. 
There are also some data cleaning steps I have added and documented in this code as well. I removed parentheses from string data mentioning the year of the film for example. I then converted those to integers. Things like this make exploratory data analysis and modeling easier.
Note that I use the sleep function to avoid being restricted by IMDB when it comes to cycling through their web pages too quickly.
# Note this takes about 40 min to run if np.arange is set to 9951 as the stopping point.

pages = np.arange(1, 9951, 50) # Last time I tried, I could only go to 10000 items because after that the URI has no discernable pattern to combat webcrawlers; I just did 4 pages for demonstration purposes. You can increase this for your own projects.
headers = {'Accept-Language': 'en-US,en;q=0.8'} # If this is not specified, the default language is Mandarin

#initialize empty lists to store the variables scraped
titles = []
years = []
ratings = []
genres = []
runtimes = []
imdb_ratings = []
imdb_ratings_standardized = []
metascores = []
votes = []

for page in pages:

   #get request for sci-fi
   response = get("https://www.imdb.com/search/title?genres=sci-fi&"
                  + "start="
                  + str(page)
                  + "&explore=title_type,genres&ref_=adv_prv", headers=headers)

   sleep(randint(8,15))

   #throw warning for status codes that are not 200
   if response.status_code != 200:
       warn('Request: {}; Status code: {}'.format(requests, response.status_code))

   #parse the content of current iteration of request
   page_html = BeautifulSoup(response.text, 'html.parser')

   movie_containers = page_html.find_all('div', class_ = 'lister-item mode-advanced')

   #extract the 50 movies for that page
   for container in movie_containers:

       #conditional for all with metascore
       if container.find('div', class_ = 'ratings-metascore') is not None:

           #title
           title = container.h3.a.text
           titles.append(title)

           if container.h3.find('span', class_= 'lister-item-year text-muted unbold') is not None:

             #year released
             year = container.h3.find('span', class_= 'lister-item-year text-muted unbold').text # remove the parentheses around the year and make it an integer
             years.append(year)

           else:
             years.append(None) # each of the additional if clauses are to handle type None data, replacing it with an empty string so the arrays are of the same length at the end of the scraping

           if container.p.find('span', class_ = 'certificate') is not None:

             #rating
             rating = container.p.find('span', class_= 'certificate').text
             ratings.append(rating)

           else:
             ratings.append("")

           if container.p.find('span', class_ = 'genre') is not None:

             #genre
             genre = container.p.find('span', class_ = 'genre').text.replace("\n", "").rstrip().split(',') # remove the whitespace character, strip, and split to create an array of genres
             genres.append(genre)

           else:
             genres.append("")

           if container.p.find('span', class_ = 'runtime') is not None:

             #runtime
             time = int(container.p.find('span', class_ = 'runtime').text.replace(" min", "")) # remove the minute word from the runtime and make it an integer
             runtimes.append(time)

           else:
             runtimes.append(None)

           if float(container.strong.text) is not None:

             #IMDB ratings
             imdb = float(container.strong.text) # non-standardized variable
             imdb_ratings.append(imdb)

           else:
             imdb_ratings.append(None)

           if container.find('span', class_ = 'metascore').text is not None:

             #Metascore
             m_score = int(container.find('span', class_ = 'metascore').text) # make it an integer
             metascores.append(m_score)

           else:
             metascores.append(None)

           if container.find('span', attrs = {'name':'nv'})['data-value'] is not None:

             #Number of votes
             vote = int(container.find('span', attrs = {'name':'nv'})['data-value'])
             votes.append(vote)

           else:
               votes.append(None)

           else:
               votes.append(None)

Pandas dataframes take as input arrays of data for each of their columns in key:value pairs. I did a couple extra data cleaning steps here to finalize the data cleaning. 
After you run the following cell, you should have a dataframe with the data you scraped.
sci_fi_df = pd.DataFrame({'movie': titles,
                      'year': years,
                      'rating': ratings,
                      'genre': genres,
                      'runtime_min': runtimes,
                      'imdb': imdb_ratings,
                      'metascore': metascores,
                      'votes': votes}
                      )

sci_fi_df.loc[:, 'year'] = sci_fi_df['year'].str[-5:-1] # two more data transformations after scraping
# Drop 'ovie' bug
# Make year an int
sci_fi_df['n_imdb'] = sci_fi_df['imdb'] * 10
final_df = sci_fi_df.loc[sci_fi_df['year'] != 'ovie'] # One small issue with the scrape on these two movies so just dropping those ones.
final_df.loc[:, 'year'] = pd.to_numeric(final_df['year'])

Exploratory Data Analysis
Now that you have the data, one of the first things you might want to do is learn more about it at a high level. The following commands are a useful first look at any data and we’ll use them next:
final_df.head()

This command shows you the first 5 rows of your dataframe. It helps you see that nothing looks weird and everything is ready for analysis. You can see the output here:

_The first five rows of data outputted using the final_df.head() command_
final_df.describe()

This command will provide you with the mean, standard deviation, and other summaries. Count can show you if there are any null values in some of the columns which is useful information to know. The year column, for example, shows you the range of movies scraped – from 1927 to 2022. 
You can see the output below and inspect the others:

_Running final_df.describe() produces summary statistics showing the number of data points, averages, standard deviations, and more._
final_df.info()

This command lets you know the data types you are working with in each of your columns. 
As a data scientist, this information can be helpful to you. Certain functions and methods need certain data types. You can also ensure that your underlying data types are in a format that makes sense for what they are. 
For example: a 5 star rating should be a float or int (if decimals are not allowed). It should not be a string since it's a number. Here’s a summary of what the data format was for each variable after the scraping:

_Running final_df.info() results in showing you how many values you have in each column and what their data types are._
The next command to learn more about your variables produces a heatmap. The heatmap shows the correlation between all your quantitative variables. This is a quick way to assess relationships that may exist between variables. I like to see the coefficients rather than trying to decipher the color code, so I use the annot=True argument.
sns.heatmap(final_df.corr(), annot=True);

The command above produces the following visualization using the Seaborn data visualization package:

_A heatmap of correlations after running sns.heatmap(final_df.corr(), annot=True);_
You can see that the strongest correlation is between the IMDB score and the metascore. This is not surprising since it's likely that two movie rating systems rate similarly.
The next strongest correlation you can see is between the IMDB rating and the number of votes. This is interesting because as the number of votes increases, you have a more representative sample of the population rating. It's strange to see that there is a weak association between the two, though.
The number of votes roughly increases as the runtime increases as well.
You can also see a slight negative association between IMDB or metascore and the year the movie came out. We’ll look at this shortly.
You can check out some of these relationships visually via a scatter plot with this code:
x = final_df['n_imdb']
y = final_df['votes']
plt.scatter(x, y, alpha=0.5) # s= is size var, c= is color var
plt.xlabel("IMDB Rating Standardized")
plt.ylabel("Number of Votes")
plt.title("Number of Votes vs. IMDB Rating")
plt.ticklabel_format(style='plain')
plt.show()

This results in this visualization:

IMDB Ratings vs. the Number of Votes
The association above shows some outliers. Generally, we see a greater number of votes on movies that have an IMDB rating of 85 or more. There are fewer reviews on movies with a rating of 75 or less. 
Drawing these boxes around the data can show you what I mean. There's roughly two groupings of different magnitudes:

Two Core Groups in the Data
Another thing that might be interesting to see is how many movies of each rating there are. This can show you where Sci-Fi tends to land in the ratings data. Use this code to get a bar chart of the ratings:
ax = final_df['rating'].value_counts().plot(kind='bar',
                                   figsize=(14,8),
                                   title="Number of Movies by Rating")
ax.set_xlabel("Rating")
ax.set_ylabel("Number of Movies")
ax.plot();

That code results in this chart which shows us that R and PG-13 make up the majority of these Sci-Fi movies on IMDB.

Number of Movies by Rating
I did see that there were a few movies rated as “Approved” and was curious what that was. You can filter down the dataframe with this code to drill down into that:
final_df[final_df['rating'] == 'Approved']

This revealed that most of these movies were made before the 80s:

All Rating "Approved" Movies
I went to the MPAA website and there was no mention of them on their ratings information page. It must have been phased out at some point.
You could also check out if any years or decades outperformed others on reviews. I took the average metascore by year and plotted that with the following code to explore further:
# What are the average metascores by year?
final_df.groupby('year')['metascore'].mean().plot(kind='bar', figsize=(16,8), title="Avg. Metascore by Year", xlabel="Year", ylabel="Avg. Metascore")
plt.xticks(rotation=90)
plt.plot();

This results in the following chart:

Avg. Metascore by Movie Year
Now I’m not saying I know why, but there is a gradual, mild decline as you progress through history in the average metascore variable. It seems that ratings have leveled out around 55-60 in the last couple decades. This might be because we have more data on newer movies or newer movies tend to get reviewed more.
final_df['year'].value_counts().plot(kind='bar', figsize=[20,9])

Run the above code and you will see that the 1927 movie only had a sample of 1 review. That score is then biased and over-inflated. You will see too that the more recent movies are better represented in reviews as I suspected:

Number of Movies by Year
Data Science Project Ideas to Take This Further
You have textual, categorical, and numeric variables here. There are a few options you could try to explore more.
One thing you could do is use Natural Language Process (NLP) to see if there are any naming conventions to movie ratings, or within the world of Sci-Fi (or if you chose to do a different genre, whatever genre you chose!). 
You could change the web scraping code to pull in many more genres too. With that you could create a new inter-genre database to see if there are naming conventions by genre.
You could then try to predict the genre based on the name of the movie. You could also try to predict the IMDB rating based on the genre or year the movie came out. The latter idea would work better in the last few decades since most observations are there.
I hope this tutorial sparked curiosity in you about the world of data science and what's possible! 
You’ll find in exploratory data analysis that there are always more questions to ask. Working with that constraint is about prioritizing based on business goal(s). It’s important to start with those objectives up front or you could be in the data weeds exploring forever.
If the field of data science is interesting to you and you want to expand your skillset and enter into it professionally, consider checking out Springboard’s Data Science Career Track. In this course, Springboard guides you through all of the key concepts in depth with a 1:1 expert mentor paired with you to support you on your journey.
I've written other articles that frame data science projects in relation to business problems and walk through technical approaches to solving them on my Medium. Check those out if you are interested!
Happy coding!
Riley
 

 Web Scraping with Python – How to Scrape Data from Twitter using Tweepy and Snscrape 
Ibrahim Ogunbiyi — Tue, 12 Jul 2022 17:58:29 +0000
 If you are a data enthusiast, you'll likely agree that one of the richest sources of real-world data is social media. Sites like Twitter are full of data.
You can use the data you can get from social media in a number of ways, like sentiment analysis (analyzing people's thoughts) on a specific issue or field of interest.
There are several ways you can scrape (or gather) data from Twitter. And in this article, we will look at two of those ways: using Tweepy and Snscrape.
We will learn a method to scrape public conversations from people on a specific trending topic, as well as tweets from a particular user.
Now without further ado, let’s get started.
Tweepy vs Snscrape – Introduction to Our Scraping Tools
Now, before we get into the implementation of each platform, let's try to grasp the differences and limits of each platform.
Tweepy
Tweepy is a Python library for integrating with the Twitter API. Because Tweepy is connected with the Twitter API, you can perform complex queries in addition to scraping tweets. It enables you to take advantage of all of the Twitter API's capabilities.
But there are some drawbacks – like the fact that its standard API only allows you to collect tweets for up to a week (that is, Tweepy does not allow recovery of tweets beyond a week window, so historical data retrieval is not permitted).
Also, there are limits to how many tweets you can retrieve from a user's account. You can read more about Tweepy's functionalities here.
Snscrape
Snscrape is another approach for scraping information from Twitter that does not require the use of an API. Snscrape allows you to scrape basic information such as a user's profile, tweet content, source, and so on.
Snscrape is not limited to Twitter, but can also scrape content from other prominent social media networks like Facebook, Instagram, and others.
Its advantages are that there are no limits to the number of tweets you can retrieve or the window of tweets (that is, the date range of tweets). So Snscrape allows you to retrieve old data.
But the one disadvantage is that it lacks all the other functionalities of Tweepy – still, if you only want to scrape tweets, Snscrape would be enough.
Now that we've clarified the distinction between the two methods, let's go over their implementation one by one.
How to Use Tweepy to Scrape Tweets
Before we begin using Tweepy, we must first make sure that our Twitter credentials are ready. With that, we can connect Tweepy to our API key and begin scraping.
If you do not have Twitter credentials, you can register for a Twitter developer account by going here. You will be asked some basic questions about how you intend to use the Twitter API. After that, you can begin the implementation.
The first step is to install the Tweepy library on your local machine, which you can do by typing:
pip install git+https://github.com/tweepy/tweepy.git

How to Scrape Tweets from a User on Twitter
Now that we’ve installed the Tweepy library, let’s scrape 100 tweets from a user called john on Twitter. We'll look at the full code implementation that will let us do this and discuss it in detail so we can grasp what’s going on:
import tweepy

consumer_key = "XXXX" #Your API/Consumer key 
consumer_secret = "XXXX" #Your API/Consumer Secret Key
access_token = "XXXX"    #Your Access token key
access_token_secret = "XXXX" #Your Access token Secret key

#Pass in our twitter API authentication key
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

#Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)


username = "john"
no_of_tweets =100


try:
    #The number of tweets we want to retrieved from the user
    tweets = api.user_timeline(screen_name=username, count=no_of_tweets)

    #Pulling Some attributes from the tweet
    attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source,  tweet.text] for tweet in tweets]

    #Creation of column list to rename the columns in the dataframe
    columns = ["Date Created", "Number of Likes", "Source of Tweet", "Tweet"]

    #Creation of Dataframe
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
    print('Status Failed On,',str(e))
    time.sleep(3)

Now let's go over each part of the code in the above block.
import tweepy

consumer_key = "XXXX" #Your API/Consumer key 
consumer_secret = "XXXX" #Your API/Consumer Secret Key
access_token = "XXXX"    #Your Access token key
access_token_secret = "XXXX" #Your Access token Secret key

#Pass in our twitter API authentication key
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

#Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)

In the above code, we've imported the Tweepy library into our code, then we've created some variables where we store our Twitter credentials (The Tweepy authentication handler requires four of our Twitter credentials). So we then pass in those variable into the Tweepy authentication handler and save them into another variable.
Then the last statement of call is where we instantiated the Tweepy API and passed in the require parameters.
username = "john"
no_of_tweets =100


try:
    #The number of tweets we want to retrieved from the user
    tweets = api.user_timeline(screen_name=username, count=no_of_tweets)

    #Pulling Some attributes from the tweet
    attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source,  tweet.text] for tweet in tweets]

    #Creation of column list to rename the columns in the dataframe
    columns = ["Date Created", "Number of Likes", "Source of Tweet", "Tweet"]

    #Creation of Dataframe
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
    print('Status Failed On,',str(e))

In the above code, we created the name of the user (the @name in Twitter) we want to retrieved the tweets from and also the number of tweets. We then created an exception handler to help us catch errors in a more effective way.
After that, the api.user_timeline() returns a collection of the most recent tweets posted by the user we picked in the screen_name parameter and the number of tweets you want to retrieve.
In the next line of code, we passed in some attributes we want to retrieve from each tweet and saved them into a list. To see more attributes you can retrieve from a tweet, read this.
In the last chunk of code we created a dataframe and passed in the list we created along with the names of the column we created.
Note that the column names must be in the sequence of how you passed them into the attributes container (that is, how you passed those attributes in a list when you were retrieving the attributes from the tweet).
If you correctly followed the steps I described, you should have something like this:

Image by Author
Now that we are done, let's go over one more example before we move into the Snscrape implementation.
How to Scrape Tweets from a Text Search
In this method, we will be retrieving a tweet based on a search. You can do that like this:
import tweepy

consumer_key = "XXXX" #Your API/Consumer key 
consumer_secret = "XXXX" #Your API/Consumer Secret Key
access_token = "XXXX"    #Your Access token key
access_token_secret = "XXXX" #Your Access token Secret key

#Pass in our twitter API authentication key
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

#Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)


search_query = "sex for grades"
no_of_tweets =150


try:
    #The number of tweets we want to retrieved from the search
    tweets = api.search_tweets(q=search_query, count=no_of_tweets)

    #Pulling Some attributes from the tweet
    attributes_container = [[tweet.user.name, tweet.created_at, tweet.favorite_count, tweet.source,  tweet.text] for tweet in tweets]

    #Creation of column list to rename the columns in the dataframe
    columns = ["User", "Date Created", "Number of Likes", "Source of Tweet", "Tweet"]

    #Creation of Dataframe
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
    print('Status Failed On,',str(e))

The above code is similar to the previous code, except that we changed the API method from api.user_timeline() to api.search_tweets(). We've also added tweet.user.name to the attributes container list.
In the code above, you can see that we passed in two attributes. This is because if we only pass in tweet.user, it would only return a dictionary user object. So we must also pass in another attribute we want to retrieve from the user object, which is name.
You can go here to see a list of additional attributes that you can retrieve from a user object. Now you should see something like this once you run it:

Image by Author.
Alright, that just about wraps up the Tweepy implementation. Just remember that there is a limit to the number of tweets you can retrieve, and you can not retrieve tweets more than 7 days old using Tweepy.
How to Use Snscrape to Scrape Tweets
As I mentioned previously, Snscrape does not require Twitter credentials (API key) to access it. There is also no limit to the number of tweets you can fetch.
For this example, though, we'll just retrieve the same tweets as in the previous example, but using Snscrape instead.
To use Snscrape, we must first install its library on our PC. You can do that by typing:
pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git

How to Scrape Tweets from a User with Snscrape
Snscrape includes two methods for getting tweets from Twitter: the command line interface (CLI) and a Python Wrapper. Just keep in mind that the Python Wrapper is currently undocumented – but we can still get by with trial and error.
In this example, we will use the Python Wrapper because it is more intuitive than the CLI method. But if you get stuck with some code, you can always turn to the GitHub community for assistance. The contributors will be happy to help you.
To retrieve tweets from a particular user, we can do the following:
import snscrape.modules.twitter as sntwitter
import pandas as pd

# Created a list to append all tweet attributes(data)
attributes_container = []

# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:john').get_items()):
    if i>100:
        break
    attributes_container.append([tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])

# Creating a dataframe from the tweets list above 
tweets_df = pd.DataFrame(attributes_container, columns=["Date Created", "Number of Likes", "Source of Tweet", "Tweets"])

Let's go over some of the code that you might not understand at first glance:
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:john').get_items()):
    if i>100:
        break
    attributes_container.append([tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])


# Creating a dataframe from the tweets list above 
tweets_df = pd.DataFrame(attributes_container, columns=["Date Created", "Number of Likes", "Source of Tweet", "Tweets"])

In the above code, what the sntwitter.TwitterSearchScaper does is return an object of tweets from the name of the user we passed into it (which is john).
As I mentioned earlier, Snscrape does not have limits on numbers of tweets so it will return however many tweets from that user. To help with this, we need to add the enumerate function which will iterate through the object and add a counter so we can access the most recent 100 tweets from the user.
You can see that the attributes syntax we get from each tweet looks like the one from Tweepy. These are the list of attributes that we can get from the Snscrape tweet which was curated by Martin Beck.

Credit: Martin Beck
More attributes might be added, as the Snscrape library is still in development. Like for instance in the above image, source has been replaced with sourceLabel. If you pass in only source it will return an object.
If you run the above code, you should see something like this as well:

Image by Author
Now let's do the same for scraping by search.
How to Scrape Tweets from a Text Search with Snscrape
import snscrape.modules.twitter as sntwitter
import pandas as pd

# Creating list to append tweet data to
attributes_container = []

# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('sex for grades since:2021-07-05 until:2022-07-06').get_items()):
    if i>150:
        break
    attributes_container.append([tweet.user.username, tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])

# Creating a dataframe to load the list
tweets_df = pd.DataFrame(attributes_container, columns=["User", "Date Created", "Number of Likes", "Source of Tweet", "Tweet"])

Again, you can access a lot of historical data using Snscrape (unlike Tweepy, as its standard API cannot exceed 7 days. The premium API is 30 days.). So we can pass in the date from which we want to start the search and the date we want it to end in the sntwitter.TwitterSearchScraper() method.
What we've done in the preceding code is basically what we discussed before. The only thing to bear in mind is that until works similarly to the range function in Python (that is, it excludes the last integer). So if you want to get tweets from today, you need to include the day after today in the "until" parameter.

Image of Author.
Now you know how to scrape tweets with Snscrape, too!
When to use each approach
Now that we've seen how each method works, you might be wondering when to use which.
Well, there is no universal rule for when to utilize each method. Everything comes down to a matter preference and your use case.
If you want to acquire an endless number of tweets, you should use Snscrape. But if you want to use extra features that Snscrape cannot provide (like geolocation, for example), then you should definitely use Tweepy. It is directly integrated with the Twitter API and provides complete functionality.
Even so, Snscrape is the most commonly used method for basic scraping.
Conclusion
In this article, we learned how to scrape data from Python using Tweepy and Snscrape. But this was only a brief overview of how each approach works. You can learn more by exploring the web for additional information.
I've included some useful resources that you can use if you need additional information. Thank you for reading.
https://github.com/JustAnotherArchivist/snscrape
 
https://docs.tweepy.org/en/stable/index.html
 
https://betterprogramming.pub/how-to-scrape-tweets-with-snscrape-90124ed006af
 

 Python Project – How to Create a Horoscope API with Beautiful Soup and Flask 
Ashutosh Krishna — Fri, 17 Dec 2021 18:29:53 +0000
 Have you ever read your horoscope in the newspaper or seen it on television? Well, I'm not sure about other countries, but in my country of India, people still read their horoscopes. 
And this is where I got the idea for this tutorial. It might sound a bit old-fashioned, but the main focus here is not on the horoscope itself – it's just the vehicle for our learning. 
In this article, we're going to scrape a website called Horoscope.com using Beautiful Soup and then create our own API using Flask. This API, if deployed on a public server, can then be used by other developers who would wish to create a website to show their horoscope or an app for the same.
How to Set Up the Project
First of all, we're going to create a virtual environment within which we'll install all the required dependencies. 
Python now ships with the pre-installed venv library. So, to create a virtual environment, you can use the below command:
$ python -m venv env

To activate the virtual environment named env, use the command:

On Windows:

env\Scripts\activate.bat


On Linux and MacOS:

source env/bin/activate

To deactivate the environment (not required at this stage):
deactivate

Now we're ready to install the dependencies. The modules and libraries we are going to use in this project are:

    requests: Requests allow you to send HTTP/1.1 requests extremely easily. The module doesn't come pre-installed with Python, so we need to install it using the command:

    $ pip install requests
    
    bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. The module doesn't come pre-installed with Python, so we need to install it using the command:
    $ pip install bs4
    
    Flask: Flask is a simple, easy-to-use microframework for Python that can help build scalable and secure web applications. The module doesn't come pre-installed with Python, so we need to install it using the command:
    $ pip install flask
    
    Flask-RESTX: Flask-RESTX lets you create APIs with Swagger Documentation. The module doesn't come pre-installed with Python, so we need to install it using the command:
    $ pip install flask-restx
    


We'll also use environment variables in this project. So, we are going to install another module called python-decouple to handle this:
pip install python-decouple

To learn more about environment variables in Python, you can check out this article.
Project Workflow
The basic workflow of the project will be like this:

The horoscope data will be scraped from Horoscope.com.
The data will then be used by our Flask server to send JSON response to the user.

How to Set Up a Flask Project
The first thing we're going to do is to create a Flask project. If you check the official documentation of Flask, you'll find a minimal application there. 
But, we're not going to follow that. We are going to write an application that is more extensible and has a good base structure. If you wish, you can follow this guide to get started with Flask.
Our application will exist within a package called core. To convert a usual directory to a Python package, we just need to include an __init__.py file. So, let's create our core package first.
$ mkdir core

After that, let's create the __init__.py file inside the core directory:
$ cd core
$ touch __init__.py
$ cd ..

In the root directory of the project, create a file called config.py. We'll store the configurations for the project in this file. Within the file, add the following content:
from decouple import config


class Config(object):
    SECRET_KEY = config('SECRET_KEY', default='guess-me')
    DEBUG = False
    TESTING = False
    CSRF_ENABLED = True


class ProductionConfig(Config):
    DEBUG = False
    MAIL_DEBUG = False


class StagingConfig(Config):
    DEVELOPMENT = True
    DEBUG = True


class DevelopmentConfig(Config):
    DEVELOPMENT = True
    DEBUG = True


class TestingConfig(Config):
    TESTING = True

In the above script, we have created a Config class and defined various attributes inside that. Also, we have created different child classes (as per different stages of development) that inherit the Config class.
Notice that we have the SECRET_KEY set to an environment variable named SECRET_KEY. Create a file named .env in the root directory and add the following content there:
APP_SETTINGS=config.DevelopmentConfig
SECRET_KEY=gufldksfjsdf

Apart from SECRET_KEY, we have APP_SETTINGS that refers to one of the classes we created in the config.py file. We set it to the current stage of the project.
Now, we can add the following content in the __init__.py file:
from flask import Flask
from decouple import config
from flask_restx import Api

app = Flask(__name__)
app.config.from_object(config("APP_SETTINGS"))
api = Api(
    app,
    version='1.0',
    title='Horoscope API',
    description='Get horoscope data easily using the below APIs',
    license='MIT',
    contact='Ashutosh Krishna',
    contact_url='https://ashutoshkrris.tk',
    contact_email='contact@ashutoshkrris.tk',
    doc='/',
    prefix='/api/v1'
)

In the above Python script, we are first importing the Flask class from the Flask module that we have installed. Next, we're creating an object app of class Flask. We use the __name__ argument to indicate the app's module or package, so that Flask knows where to find other files such as templates. 
Next we are setting the app configurations to the APP_SETTINGS according to the variable in the .env file. 
Apart from that, we have created an object of the Api class. We need to pass various arguments to it. We can find the Swagger documentation on the / route. The /api/v1 will be prefixed on each API route. 
For now, let's create a routes.py file in the core package and just add the following namespace:
from core import api
from flask import jsonify

ns = api.namespace('/', description='Horoscope APIs')

We need to import the routes in the __init__.py file:
from flask import Flask
from decouple import config
from flask_restx import Api

app = Flask(__name__)
app.config.from_object(config("APP_SETTINGS"))
api = Api(
    app,
    version='1.0',
    title='Horoscope API',
    description='Get horoscope data easily using the below APIs',
    license='MIT',
    contact='Ashutosh Krishna',
    contact_url='https://ashutoshkrris.tk',
    contact_email='contact@ashutoshkrris.tk',
    doc='/',
    prefix='/api/v1'
)

from core import routes            # Add this line

We're now just left with one file which will help us run the Flask server:
from core import app

if __name__ == '__main__':
    app.run()

Once you run this file using the python main.py command, you'll see a similar output:

Now, we are ready to scrape the data from the Horoscope website.
How to Scrape the Data from Horoscope.com
If you open Horoscope.com and choose your zodiac sign, the horoscope data for your zodiac sign for today will be shown.

Source: Horoscope.com
In the above image, you can see you can view the horoscope for yesterday, tomorrow, weekly, monthly or even a custom date. We're going to use all of these.
But first if you see the URL of the current page, it is something like: https://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-today.aspx?sign=10 . 
The URL has two variables, if you see clearly, sign and today. The value of variable sign will be assigned according to the zodiac sign. The variable today can be replaced with yesterday and tomorrow.
The dictionary below can help us with the zodiac signs:
ZODIAC_SIGNS = {
    "Aries": 1,
    "Taurus": 2,
    "Gemini": 3,
    "Cancer": 4,
    "Leo": 5,
    "Virgo": 6,
    "Libra": 7,
    "Scorpio": 8,
    "Sagittarius": 9,
    "Capricorn": 10,
    "Aquarius": 11,
    "Pisces": 12
}

This means that if your zodiac sign is Capricorn, the value of sign in the URL will be 10. 
Next, if we wish to get the horoscope data for a custom date, the URL https://www.horoscope.com/us/horoscopes/general/horoscope-archive.aspx?sign=10&laDate=20211213 will help us. 
It has the same sign variable, but it has another variable laDate which takes the date in YYYYMMDD format. 
Now, we're ready to create different functions to fetch horoscope data. Create a utils.py file and follow along.
Howe to Get a Horoscope for the Day
import requests
from bs4 import BeautifulSoup


def get_horoscope_by_day(zodiac_sign: int, day: str):
    if not "-" in day:
        res = requests.get(f"https://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-{day}.aspx?sign={zodiac_sign}")
    else:
        day = day.replace("-", "")
        res = requests.get(f"https://www.horoscope.com/us/horoscopes/general/horoscope-archive.aspx?sign={zodiac_sign}&laDate={day}")
    soup = BeautifulSoup(res.content, 'html.parser')
    data = soup.find('div', attrs={'class': 'main-horoscope'})
    return data.p.text

We have created our first function which accepts two arguments – an integer zodiac_sign and a string day. The day can either be today, tomorrow, yesterday or any custom date before today in the format YYYY-MM-DD. 
If the day is not a custom date, it won't have the hyphen(-) symbol in it. So, we have put a condition for the same. 
If there is no hyphen symbol, we make a GET request on https://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-{_day_}.aspx?sign={_zodiac_sign_}. Else first, we change the date from YYYY-MM-DD to YYYYMMDD format. 
Then we make a GET request on https://www.horoscope.com/us/horoscopes/general/horoscope-archive.aspx?sign={_zodiac_sign_}&laDate={_day_}. 
After that, we pull the HTML data from the response content of the page using BeautifulSoup. Now we need to get the horoscope text from this HTML code. If you inspect the code of any of the webpage, you'll find this:

The horoscope text is contained in a div with the class main-horoscope. Thus we use the soup.find() function to extract the paragraph text string, and return it.
How to Get a Horoscope for the Week
def get_horoscope_by_week(zodiac_sign: int):
    res = requests.get(f"https://www.horoscope.com/us/horoscopes/general/horoscope-general-weekly.aspx?sign={zodiac_sign}")
    soup = BeautifulSoup(res.content, 'html.parser')
    data = soup.find('div', attrs={'class': 'main-horoscope'})
    return data.p.text

The above function is quite similar to the previous one. We have just changed the URL to https://www.horoscope.com/us/horoscopes/general/horoscope-general-weekly.aspx?sign={_zodiac_sign_}.
How to Get a Horoscope for the Month
def get_horoscope_by_month(zodiac_sign: int):
    res = requests.get(f"https://www.horoscope.com/us/horoscopes/general/horoscope-general-monthly.aspx?sign={zodiac_sign}")
    soup = BeautifulSoup(res.content, 'html.parser')
    data = soup.find('div', attrs={'class': 'main-horoscope'})
    return data.p.text

This function is also similar to the other two except the URL which has now been changed to https://www.horoscope.com/us/horoscopes/general/horoscope-general-monthly.aspx?sign={_zodiac_sign_}.
How to Create API Routes
We'll be using Flask-RESTX to create our API routes. The API routes will look like these:

For daily or custom dates:/api/v1/get-horoscope/daily?day=today&sign=capricorn or api/v1/get-horoscope/daily?day=2022-12-14&sign=capricorn
For weekly: api/v1/get-horoscope/weekly?sign=capricorn
For monthly: api/v1/get-horoscope/monthly?sign=capricorn

We have two query parameters in the URLs: day and sign. The day parameter can take values like today, yesterday, or custom dates like 2022-12-14. The sign parameter will take the zodiac sign name which can be in uppercase or lowercase, it won't matter.
To parse the query parameters from the URL, Flask-RESTX has built-in support for request data validation using a library similar to argparse called reqparse. To add arguments in the URL, we'll use add_argument method of the RequestParser class.
parser = reqparse.RequestParser()
parser.add_argument('sign', type=str, required=True)

The type parameter will take the type of parameter. The required=True makes the query parameter mandatory to be passed. 
Now, we need another query parameter day. But this parameter will be used only in the daily horoscope URL. 
Instead of rewriting arguments we can write a parent parser containing all the shared arguments and then extend the parser with copy().
parser_copy = parser.copy()
parser_copy.add_argument('day', type=str, required=True)

The parser_copy will not only contain day, but also sign. That is what we'll require for the daily horoscope.
The main building blocks provided by Flask-RESTX are resources. Resources are built on top of Flask pluggable views, giving you easy access to multiple HTTP methods just by defining methods on your resource. 
Let's create the DailyHoroscopeAPI class that inherits the Resource class from flask_restx.
@ns.route('/get-horoscope/daily')
class DailyHoroscopeAPI(Resource):
    '''Shows daily horoscope of zodiac signs'''
    @ns.doc(parser=parser_copy)
    def get(self):
        args = parser_copy.parse_args()
        day = args.get('day')
        zodiac_sign = args.get('sign')
        try:
            zodiac_num = ZODIAC_SIGNS[zodiac_sign.capitalize()]
            if "-" in day:
                datetime.strptime(day, '%Y-%m-%d')
            horoscope_data = get_horoscope_by_day(zodiac_num, day)
            return jsonify(success=True, data=horoscope_data, status=200)
        except KeyError:
            raise NotFound('No such zodiac sign exists')
        except AttributeError:
            raise BadRequest(
                'Something went wrong, please check the URL and the arguments.')
        except ValueError:
            raise BadRequest('Please enter day in correct format: YYYY-MM-DD')

The @ns.route() decorator sets the API route. Inside the DailyHoroscopeAPI class, we have the get method that will handle the GET requests. The @ns.doc() decorator will help us add the query parameters on the URL. 
To get the values of query parameters, we'll use the parse_args() method that will return us a dictionary like this:
{'sign': 'capricorn', 'day': '2022-12-14'}

We can then get the values using the keys day and sign.
As defined in the beginning, we'll have a ZODIAC_SIGNS dictionary. We use a try-except block to handle the request. If the zodiac sign is not in the dictionary, a KeyError Exception is raised. In that case, we respond with a NotFound error (Error 404). 
Also, if the day parameter has a hyphen in it, we try to match the date format with YYYY-MM-DD. If it's not in that format, we raise a BadRequest error (Error 400). If the day doesn't contain a hyphen, we directly call the get_horoscope_by_day() method with the sign and day arguments. 
If some gibberish is passed as the parameter value, an AttributeError is raised. In that case, we raise a BadRequest error.
The other two routes are also quite similar to the above one. The difference is, we don't need a day parameter here. So, instead of using parser_copy, we'll use parser here.
@ns.route('/get-horoscope/weekly')
class WeeklyHoroscopeAPI(Resource):
    '''Shows weekly horoscope of zodiac signs'''
    @ns.doc(parser=parser)
    def get(self):
        args = parser.parse_args()
        zodiac_sign = args.get('sign')
        try:
            zodiac_num = ZODIAC_SIGNS[zodiac_sign.capitalize()]
            horoscope_data = get_horoscope_by_week(zodiac_num)
            return jsonify(success=True, data=horoscope_data, status=200)
        except KeyError:
            raise NotFound('No such zodiac sign exists')
        except AttributeError:
            raise BadRequest('Something went wrong, please check the URL and the arguments.')


@ns.route('/get-horoscope/monthly')
class MonthlyHoroscopeAPI(Resource):
    '''Shows monthly horoscope of zodiac signs'''
    @ns.doc(parser=parser)
    def get(self):
        args = parser.parse_args()
        zodiac_sign = args.get('sign')
        try:
            zodiac_num = ZODIAC_SIGNS[zodiac_sign.capitalize()]
            horoscope_data = get_horoscope_by_month(zodiac_num)
            return jsonify(success=True, data=horoscope_data, status=200)
        except KeyError:
            raise NotFound('No such zodiac sign exists')
        except AttributeError:
            raise BadRequest('Something went wrong, please check the URL and the arguments.')

Now our routes are done. To test the APIs, you can use the Swagger documentation available on the / route, or you can use Postman. Let's run the server and test it.

        
You can also deploy the project on a public server so that other developers can access and use the API too.
Wrapping up
In this tutorial, we learned how to scrape data from a website using requests and Beautiful Soup. Then we created an API using Flask and Flask-RESTX. 
If you wish to learn how to interact with APIs using Python, check out this guide.
I hope you enjoyed it – and thanks for reading!
Code for the tutorial: https://github.com/ashutoshkrris/Horoscope-API 
 

 Python Web Scraping Tutorial – How to Scrape Data From Any Website with Python 
freeCodeCamp — Tue, 10 Aug 2021 17:42:52 +0000
 By Sorin-Gabriel Marica
Web scraping is the process of extracting specific data from the internet automatically. It has many use cases, like getting data for a machine learning project, creating a price comparison tool, or any other innovative idea that requires an immense amount of data.
While you can theoretically do data extraction manually, the vast contents of the internet makes this approach unrealistic in many cases. So knowing how to build a web scraper can come in handy. 
This article’s purpose is to teach you how to create a web scraper in Python. You will learn how to inspect a website to prepare for scraping, extract specific data using BeautifulSoup, wait for JavaScript rendering using Selenium, and save everything in a new JSON or CSV file.
But first, I should warn you about the legality of web scraping. While the act of scraping is legal, the data you may extract can be illegal to use. Make sure that you're not messing with any:

Copyrighted content – since it's someone's intellectual property, it's protected by law and you can't just reuse it.
Personal data – if the information you gather can be used to identify a person, then it's considered personal data and for EU citizens, it's protected under the GDPR. Unless you have a lawful reason to store that data, it's better to just skip it altogether.

Generally speaking, you should always read a website's terms and conditions before scraping to make sure that you're not going against their policies. If you're ever unsure how to proceed, contact the site owner and ask for consent. 
What Will You Need for Your Scraper?
To start building your own web scraper, you will first need to have Python installed on your machine. Ubuntu 20.04 and other versions of Linux come with Python 3 pre-installed. 
To check if you already have Python installed on your device, run the following command:
python3 -v
If you have Python installed, you should receive an output like this:
Python 3.8.2
Also, for our web scraper, we will use the Python packages BeautifulSoup (for selecting specific data) and Selenium (for rendering dynamically loaded content). To install them, just run these commands:
pip3 install beautifulsoup4
and
pip3 install selenium
The final step it’s to make sure you install Google Chrome and Chrome Driver on your machine. These will be necessary if we want to use Selenium to scrape dynamically loaded content.
How to Inspect the Page
Now that you have everything installed, it’s time to start our scraping project in earnest. 
You should choose the website you want to scrape based on your needs. Keep in mind that each website structures its content differently, so you’ll need to adjust what you learn here when you start scraping on your own. Each website will require minor changes to the code.
For this article, I decided to scrape information about the first ten movies from the top 250 movies list from IMDb: https://www.imdb.com/chart/top/. 
First, we will get the titles, then we will dive in further by extracting information from each movie’s page. Some of the data will require JavaScript rendering.
To start understanding the content’s structure, you should right-click on the first title from the list and then choose “Inspect Element”.

By pressing CTRL+F and searching in the HTML code structure, you will see that there is only one 






















 








 tag on the page. This is useful as it gives us information about how we can access the data.
An HTML selector that will give us all of the titles from the page is table tbody tr td.titleColumn a. That’s because all titles are in an anchor inside a table cell with the class “titleColumn”. 
Using this CSS selector and getting the innerText of each anchor will give us the titles that we need. You can simulate that in the browser console from the new window you just opened and by using the JavaScript line:
document.querySelectorAll("table tbody tr td.titleColumn a")[0].innerText
You will see something like this:

Now that we have this selector, we can start writing our Python code and extracting the information we need.
How to Use BeautifulSoup to Extract Statically Loaded Content
The movie titles from our list are static content. That’s because if you look into the page source (CTRL+U on the page or right-click and then choose View Page Source), you will see that the titles are already there.
Static content is usually easier to scrape as it doesn’t require JavaScript rendering. To extract the first ten titles on the list, we will use BeautifulSoup to get the content and then print it in the output of our scraper.
import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.imdb.com/chart/top/') # Getting page HTML through request
soup = BeautifulSoup(page.content, 'html.parser') # Parsing content using beautifulsoup

links = soup.select("table tbody tr td.titleColumn a") # Selecting all of the anchors with titles
first10 = links[:10] # Keep only the first 10 anchors
for anchor in first10:
    print(anchor.text) # Display the innerText of each anchor

The code above uses the selector we saw in the first step to extract the movie title anchors from the page. It then loops through the first ten and displays the innerText of each.
The output should look like this:

How to Extract Dynamically Loaded Content
As technology advanced, websites started to load their content dynamically. This improves the page’s performance, the user's experience, and even removes an extra barrier for scrapers.
This complicates things, though, as the HTML retrieved from a simple request will not contain the dynamic content. Fortunately, with Selenium, we can simulate a request in the browser and wait for the dynamic content to be displayed.
How to Use Selenium for Requests
You will need to know the location of your chromedriver. The following code is identical to the one presented in the second step, but this time we are using Selenium to make the request. We will still parse the page’s content using BeautifulSoup, as we did before.
from bs4 import BeautifulSoup
from selenium import webdriver

option = webdriver.ChromeOptions()
# I use the following options as my machine is a window subsystem linux. 
# I recommend to use the headless option at least, out of the 3
option.add_argument('--headless')
option.add_argument('--no-sandbox')
option.add_argument('--disable-dev-sh-usage')
# Replace YOUR-PATH-TO-CHROMEDRIVER with your chromedriver location
driver = webdriver.Chrome('YOUR-PATH-TO-CHROMEDRIVER', options=option)

driver.get('https://www.imdb.com/chart/top/') # Getting page HTML through request
soup = BeautifulSoup(driver.page_source, 'html.parser') # Parsing content using beautifulsoup. Notice driver.page_source instead of page.content

links = soup.select("table tbody tr td.titleColumn a") # Selecting all of the anchors with titles
first10 = links[:10] # Keep only the first 10 anchors
for anchor in first10:
    print(anchor.text) # Display the innerText of each anchor

Don’t forget to replace “YOUR-PATH-TO-CHROMEDRIVER” with the location where you extracted the chromedriver. Also, you should notice that instead of page.content, when we are creating the BeautifulSoup object, we are now using driver.page_source, which provides the HTML content of the page.
How to Extract Statically Loaded Content Using Selenium
Using the code from above, we can now access each movie page by calling the click method on each of the anchors.
first_link = driver.find_elements_by_css_selector('table tbody tr td.titleColumn a')[0]
first_link.click()

This will simulate a click on the first movie’s link. However, in this case, I recommend that you continue using driver.get instead. This is because you will no longer be able to use the click() method after you go on a different page since the new page doesn't have links to the other nine movies.
As a result, after clicking on the first title from the list, you’d need to go back to the first page, then click on the second, and so on. This is a waste of performance and time. Instead, we will just use the extracted links and access them one by one.
For “The Shawshank Redemption”, the movie page will be https://www.imdb.com/title/tt0111161/. We will extract the movie’s year and duration from the page, but this time we will use Selenium’s functions instead of BeautifulSoup as an example. In practice, you can use either one, so pick your favorite.
To retrieve the movie’s year and duration, you should repeat the first step we went through here on the movie’s page. 
You will notice that you can find all of the information in the first element with the class ipc-inline-list (".ipc-inline-list" selector) and that all of the elements of the list contain the attribute role with the value presentation (the [role=’presentation’] selector).
from bs4 import BeautifulSoup
from selenium import webdriver

option = webdriver.ChromeOptions()
# I use the following options as my machine is a window subsystem linux. 
# I recommend to use the headless option at least, out of the 3
option.add_argument('--headless')
option.add_argument('--no-sandbox')
option.add_argument('--disable-dev-sh-usage')
# Replace YOUR-PATH-TO-CHROMEDRIVER with your chromedriver location
driver = webdriver.Chrome('YOUR-PATH-TO-CHROMEDRIVER', options=option)

page = driver.get('https://www.imdb.com/chart/top/') # Getting page HTML through request
soup = BeautifulSoup(driver.page_source, 'html.parser') # Parsing content using beautifulsoup

totalScrapedInfo = [] # In this list we will save all the information we scrape
links = soup.select("table tbody tr td.titleColumn a") # Selecting all of the anchors with titles
first10 = links[:10] # Keep only the first 10 anchors
for anchor in first10:
    driver.get('https://www.imdb.com/' + anchor['href']) # Access the movie’s page
    infolist = driver.find_elements_by_css_selector('.ipc-inline-list')[0] # Find the first element with class ‘ipc-inline-list’
    informations = infolist.find_elements_by_css_selector("[role='presentation']") # Find all elements with role=’presentation’ from the first element with class ‘ipc-inline-list’
    scrapedInfo = {
        "title": anchor.text,
        "year": informations[0].text,
        "duration": informations[2].text,
    } # Save all the scraped information in a dictionary
    totalScrapedInfo.append(scrapedInfo) # Append the dictionary to the totalScrapedInformation list

print(totalScrapedInfo) # Display the list with all the information we scraped

How to Extract Dynamically Loaded Content Using Selenium
The next big step in web scraping is extracting content that is loaded dynamically. You can find such content on each of the movie’s pages (such as https://www.imdb.com/title/tt0111161/) in the Editorial Lists section. 
If you look using inspect on the page, you'll see that you can find the section as an element with the attribute data-testid set as firstListCardGroup-editorial. But if you look in the page source, you will not find this attribute value anywhere. That’s because the Editorial Lists section is loaded by IMDB dynamically.
In the following example, we will scrape the editorial list of each movie and add it to our current results of the total scraped information. 
To do that, we will import a few more packages that make it possible to wait for our dynamic content to load.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

option = webdriver.ChromeOptions()
# I use the following options as my machine is a window subsystem linux. 
# I recommend to use the headless option at least, out of the 3
option.add_argument('--headless')
option.add_argument('--no-sandbox')
option.add_argument('--disable-dev-sh-usage')
# Replace YOUR-PATH-TO-CHROMEDRIVER with your chromedriver location
driver = webdriver.Chrome('YOUR-PATH-TO-CHROMEDRIVER', options=option)

page = driver.get('https://www.imdb.com/chart/top/') # Getting page HTML through request
soup = BeautifulSoup(driver.page_source, 'html.parser') # Parsing content using beautifulsoup

totalScrapedInfo = [] # In this list we will save all the information we scrape
links = soup.select("table tbody tr td.titleColumn a") # Selecting all of the anchors with titles
first10 = links[:10] # Keep only the first 10 anchors
for anchor in first10:
    driver.get('https://www.imdb.com/' + anchor['href']) # Access the movie’s page 
    infolist = driver.find_elements_by_css_selector('.ipc-inline-list')[0] # Find the first element with class ‘ipc-inline-list’
    informations = infolist.find_elements_by_css_selector("[role='presentation']") # Find all elements with role=’presentation’ from the first element with class ‘ipc-inline-list’
    scrapedInfo = {
        "title": anchor.text,
        "year": informations[0].text,
        "duration": informations[2].text,
    } # Save all the scraped information in a dictionary
    WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "[data-testid='firstListCardGroup-editorial']")))  # We are waiting for 5 seconds for our element with the attribute data-testid set as `firstListCardGroup-editorial`
    listElements = driver.find_elements_by_css_selector("[data-testid='firstListCardGroup-editorial'] .listName") # Extracting the editorial lists elements
    listNames = [] # Creating an empty list and then appending only the elements texts
    for el in listElements:
        listNames.append(el.text)
    scrapedInfo['editorial-list'] = listNames # Adding the editorial list names to our scrapedInfo dictionary
    totalScrapedInfo.append(scrapedInfo) # Append the dictionary to the totalScrapedInformation list

print(totalScrapedInfo) # Display the list with all the information we scraped

For the previous example, you should get the following output:

How to Save the Scraped Content
Now that we have all the data we want, we can save it as a .json or a .csv file for easier readability. 
To do that, we will just use the JSON and CVS packages from Python and write our content to new files:
import csv
import json

...

file = open('movies.json', mode='w', encoding='utf-8')
file.write(json.dumps(totalScrapedInfo))

writer = csv.writer(open("movies.csv", 'w'))
for movie in totalScrapedInfo:
    writer.writerow(movie.values())

Scraping Tips and Tricks
While our guide so far is already advanced enough to take care of JavaScript rendering scenarios, there are still many things to explore in Selenium. 
In this section, I will share some tips and tricks that may come in handy.
1. Time your requests
If you spam a server with hundreds of requests in a short time, it’s very probable that at some point, a captcha code will appear, or your IP might even get blocked. Unfortunately, there is no workaround in Python to avoid that. 
Therefore, you should put some timeout breaks between each request so that the traffic will look more natural.
import time
import requests

page = requests.get('https://www.imdb.com/chart/top/') # Getting page HTML through request
time.sleep(30) # Wait 30 seconds
page = requests.get('https://www.imdb.com/') # Getting page HTML through request

2. Error handling
Since websites are dynamic and they can change structure at any moment, error handling might come in handy if you use the same web scraper frequently.
try:
    WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CSS_SELECTOR, "your selector")))
    break
except TimeoutException:
    # If the loading took too long, print message and try again
    print("Loading took too much time!")

The try and error syntax can be useful when you’re waiting for an element, extracting it, or even when you’re just making the request.
3. Take Screenshots
If you need to obtain a screenshot of the web page you are scraping at any moment, you can use:
driver.save_screenshot(‘screenshot-file-name.png’)

This can help debug when you’re working with dynamically loaded content.
4. Read the documentation
Last but not least, don’t forget to read the documentation from Selenium. This library contains information about how to do most of the actions you can do in a browser. 
Using Selenium, you can fill out forms, press buttons, answer popup messages, and do many other cool things. 
If you’re facing a new problem, their documentation can be your best friend.
Final Thoughts
This article’s purpose is to give you an advanced introduction to web scraping using Python with Selenium and BeautifulSoup. While there are still many features from both technologies to explore, you now have a solid base on how to start scraping.
Sometimes web scraping can be very difficult, as websites start to put more and more obstacles in the developer’s way. Some of these obstacles can be Captcha codes, IP blocks, or dynamic content. Overcoming them just with Python and Selenium might be difficult or even impossible. 
So, I’ll give you an alternative as well. Try using a web scraping API that solves all those challenges for you. It also uses rotating proxies so that you don’t have to worry about adding timeouts between requests. Just remember to always check if the data you want can be lawfully extracted and used.

web scraping - freeCodeCamp.org

How to Run an AI Extractability Audit on Your Site (I Found 6 Heading Tags That Cost Me Citations)

Table of Contents

What an Extractability Audit Actually Tests

in your rendered DOM. Not the headings you wrote in your CMS. The headings your component library emits. That gap is where my six invisible failures lived.

Prerequisites

Step 1: Pick the Pages Worth Auditing

Step 2: Run the Five Checks

Step 3: Read Your Failure Classes

Step 4: Find the Components Emitting Fake Headings

post title

post title

product name

project name

milestone title

Step 5: Demote the Headings Without Breaking Accessibility

{post.title} -

. Screen readers that navigate by heading still stop here and still announce the level.

Step 6: Gate the Fix in CI

What Actually Moved

What I Rejected, and Why

FAQ

Does demoting headings hurt my regular SEO?

Is this just gaming one audit script?

I use React or Vue, not Svelte. Does anything change?

{title}

What about the headings inside my actual articles?

elements. Article body headings are your document's structure and they're precisely what should be in the census. The demotion pattern applies only to components that surface other pages' titles: cards, teasers, related-post widgets, and navigation panels.

How often should I re-audit?

My score is low but I have no card components. Now what?

What You Accomplished

Web Scraping for Beginners 2026

Traditional Scraping vs AI Scraping: A Practical Guide for Developers and Data Teams

Table of Contents

What is Traditional Web Scraping?

The Tools Behind Traditional Scraping

Traditional Scraping in Practice

Step 1: Install Dependencies

Step 2: Inspect the Page

tag, inside an element (as a title attribute)

Step 3: Write the Scraper

Step 4: Run It

Step 5: Extend It to Multiple Pages

What Makes This Approach Fragile

What is AI Web Scraping?

What's Actually Happening Under the Hood

Popular Tools Behind AI Scraping

AI Scraping in Practice

Step 1: Get an API Key

Step 2: Understand the API Structure

Step 3: Write a Single-Page Extraction

Step 4: Understand the Output

Step 5: Use Actions for Multi-Step Workflows

Where AI Scraping Earns Its Keep

Traditional Scraping vs AI Scraping: When to Use Each

Wrapping up

How to Turn Websites into LLM-Ready Data Using Firecrawl

Table of Contents

What Is Firecrawl?

Why LLMs Need Clean Data

Setting Up Firecrawl

Scraping a Single Page

Crawling an Entire Website

Extracting Structured Data with AI

Self-hosting Firecrawl using Sevalla

Use Cases

Conclusion

How to Use Python to Build Your Own Web Scraper

What is Web scraping?

What You'll Learn Here

How Beautiful Soup and Python Requests Work Together

How to Build a Web Scraper with Python

Prerequisites

Step 1: Import Necessary Libraries

Step 2: Define the Base URL and CSV Headers

Step 3: Create a Function to Scrape Dataset Details

Step 4: Create a Function to Scrape Dataset Listings

Step 5: Loop Through Pages Using Pagination Parameters

Step 6: Save the Scraped Data to a CSV File

Step 7: Run the Scraping Function

`{title}`

tag, inside an element (as a `title` attribute)