For this example, we are going to scrape the Basic Law for the Federal Republic of Germany. (Don't worry, I checked their Terms of Service. They offer an XML version for machine processing, but this page serves as an example of processing HTML. So it should be fine.)
Step 1: Download the source
First things first: I create a file urls.txt holding all the URLs I want to download:
Next, I write a bit of Python code in a file called scraper.py to download the HTML of this files.
In a real scenario, this would be too expensive and you'd use a database instead. To keep things simple, I'll download files into the same directory next to the store and use their name as the filename.
By downloading the files, I can process them locally as much as I want without being dependent on a server. Try to be a good web citizen, okay?
Step 2: Parse the source
Now that I've downloaded the files, it's time to extract their interesting features. Therefore I go to one of the pages I downloaded, open it in a web browser, and hit Ctrl-U to view its source. Inspecting it will show me the HTML structure.
In my case, I figured I want the text of the law without any markup. The element wrapping it has an id of container. Using BeautifulSoup I can see that a combination of find and get_text will do what I want.
Since I have a second step now, I'm going to refactor the code a bit by putting it into functions and add a minimal CLI.
Now I can run the code in three ways:
Without any arguments to run everything (that is, download all URLs and extract them, then save to disk) via: python scraper.py
With an argument of download and a url to download python scraper.py download https://www.gesetze-im-internet.de/gg/art_1.html. This will not process the file.
With an argument of parse and a filepath to parse: python scraper.py art_1.html. This will skip the download step.
With that, there's one last thing missing.
Step 3: Format the source for further processing
Let's say I want to generate a word cloud for each article. This can be a quick way to get an idea about what a text is about. For this, install the package wordcloud and update the file like this:
What changed? For one, I downloaded a list of German stopwords from GitHub. This way, I can eliminate the most common words from the downloaded law text.
Then I instantiate a WordCloud instance with the list of stopwords I downloaded and the text of the law. It will be turned into an image with the same basename.
After the first run, I discover that the list of stopwords is incomplete. So I add additional words I want to exclude from the resulting image.
With that, the main part of web scraping is complete.
Bonus: What about SPAs?
We'll use the browser. With Selenium. Make sure to install a driver also. Download the .tar.gz archive and unpack it in the bin folder of your virtual environment so it will be found by Selenium. That is the directory where you can find the activate script (on GNU/Linux systems).
Since the code will be slower, I create a new file called crawler.py for it. The content looks like this:
Here, Python is opening a Firefox instance, browsing the website and looking for an <article> element. It is copying over its text into a dictionary, which gets read out in the transform step and turned into a WordCloud during load.
Thanks for reading this far! Let's summarise what we've learned now:
How to scrape a website with Python's requests package.
How to translate it into a meaningful structure using beautifulsoup.
How to further process that structure into something you can work with.