Web scraping is the process of extracting data from websites.
Before attempting to scrape a website, you should make sure that the provider allows it in their terms of service. You should also check to see whether you could use an API instead.
Massive scraping can put a server under a lot of stress which can result in a denial of service. And you don't want that.
Who should read this?
This article is for advanced readers. It will assume that you are already familiar with the Python programming language.
At the very minimum you should understand list comprehension, context manager, and functions. You should also know how to set up a virtual environment.
We'll run the code on your local machine to explore some websites. With some tweaks you could make it run on a server as well.
What you will learn in this article
At the end of this article, you will know how to download a webpage, parse it for interesting information, and format it in a usable format for further processing. This is also known as ETL.
Before I can start, I want to make sure we're ready to go. Please set up a virtual environment and install the following packages into it:
- beautifulsoup4 (version 4.9.0 at time of writing)
- requests (version 2.23.0 at time of writing)
- wordcloud (version 1.17.0 at time of writing, optional)
- selenium (version 3.141.0 at time of writing, optional)
You can find the code for this project in this git repository on GitHub.
For this example, we are going to scrape the Basic Law for the Federal Republic of Germany. (Don't worry, I checked their Terms of Service. They offer an XML version for machine processing, but this page serves as an example of processing HTML. So it should be fine.)
Step 1: Download the source
First things first: I create a file
urls.txt holding all the URLs I want to download:
Next, I write a bit of Python code in a file called
scraper.py to download the HTML of this files.
In a real scenario, this would be too expensive and you'd use a database instead. To keep things simple, I'll download files into the same directory next to the store and use their name as the filename.
By downloading the files, I can process them locally as much as I want without being dependent on a server. Try to be a good web citizen, okay?
Step 2: Parse the source
Now that I've downloaded the files, it's time to extract their interesting features. Therefore I go to one of the pages I downloaded, open it in a web browser, and hit Ctrl-U to view its source. Inspecting it will show me the HTML structure.
In my case, I figured I want the text of the law without any markup. The element wrapping it has an id of
container. Using BeautifulSoup I can see that a combination of
get_text will do what I want.
Since I have a second step now, I'm going to refactor the code a bit by putting it into functions and add a minimal CLI.
Now I can run the code in three ways:
- Without any arguments to run everything (that is, download all URLs and extract them, then save to disk) via:
- With an argument of
downloadand a url to download
python scraper.py download https://www.gesetze-im-internet.de/gg/art_1.html. This will not process the file.
- With an argument of
parseand a filepath to parse:
python scraper.py art_1.html. This will skip the download step.
With that, there's one last thing missing.
Step 3: Format the source for further processing
Let's say I want to generate a word cloud for each article. This can be a quick way to get an idea about what a text is about. For this, install the package
wordcloud and update the file like this:
What changed? For one, I downloaded a list of German stopwords from GitHub. This way, I can eliminate the most common words from the downloaded law text.
Then I instantiate a WordCloud instance with the list of stopwords I downloaded and the text of the law. It will be turned into an image with the same basename.
After the first run, I discover that the list of stopwords is incomplete. So I add additional words I want to exclude from the resulting image.
With that, the main part of web scraping is complete.
Bonus: What about SPAs?
We'll use the browser. With Selenium. Make sure to install a driver also. Download the .tar.gz archive and unpack it in the
bin folder of your virtual environment so it will be found by Selenium. That is the directory where you can find the
activate script (on GNU/Linux systems).
Since the code will be slower, I create a new file called
crawler.py for it. The content looks like this:
Here, Python is opening a Firefox instance, browsing the website and looking for an
<article> element. It is copying over its text into a dictionary, which gets read out in the
transform step and turned into a WordCloud during
Thanks for reading this far! Let's summarise what we've learned now:
- How to scrape a website with Python's
- How to translate it into a meaningful structure using
- How to further process that structure into something you can work with.
I'm not the first one who wrote about Web Scraping here on freeCodeCamp. Yasoob Khalid and Dave Gray also did so in the past: