How to build a Hacker News Frontpage scraper with just 7 lines of R code
Web scraping used to be a difficult task requiring expertise in XML Tree parsing and HTTP Requests. But with new-age scraping libraries like beautifulsoup (for Python) and rvest (for R), web scraping has become a toy for any beginner to play with.
This post aims to explain how simple it is to use R, a very nice programming language, to perform Data Analysis and Data Visualization. The task ahead is very simple. Build a web scraper that scrapes the content of one of the most popular pages on the Internet (at least among Coders): Hacker News Front Page.
Package Installation and Loading
The R package that we are going to use is
rvest can be installed from CRAN and loaded into R like below:
read_html() function of
rvest can be used to extract the HTML content of the url given as the argument for read_html function.
content <- read_html('https://news.ycombinator.com/')
read_html() to work without any concern, please make sure you are not behind any organization firewall. If so, configure your RStudio with a proxy to bypass the firewall, otherwise you might face a
connection timed out error.
Below is the screenshot of HN front page layout (with key elements highlighted):
Now, with the HTML content of the Hacker News front page loaded into the R object content, let us extract the data that we need — starting with the Title.
There is one particularly important aspect of making any web scraping assignment successful. That is to identify the right CSS selector, or XPath values, of the HTML elements whose values are supposed to be scraped. The easiest way to get the right element value is to use
the inspect tool in Developer Tools of any browser.
Here’s the screenshot of the CSS selector value. It is highlighted using the Chrome Inspect Tool when hovered over Title of the links present in Hacker News Frontpage.
title <- content %>% html_nodes('a.storylink') %>% html_text()title  "Magic Leap One"  "Show HN: Terminal – native micro-GUIs for shell scripts and command line apps"  "Tokio internals: Understanding Rust's async I/O framework"  "Funding Yourself as a Free Software Developer"  "US Federal Ban on Making Lethal Viruses Is Lifted"  "Pass-Thru Income Deduction"  "Orson Welles' first attempt at movie-making"  "D’s Newfangled Name Mangling"  "Apple Plans Combined iPhone, iPad, and Mac Apps to Create One User Experience"  "LiteDB – A .NET NoSQL Document Store in a Single Data File"  "Taking a break from Adblock Plus development"  "SpaceX’s Falcon Heavy rocket sets up at Cape Canaveral ahead of launch"  "This is not a new year’s resolution"  "Artists and writers whose works enter the public domain in 2018"  "Open Beta of Texpad 1.8, macOS LaTeX editor with integrated real-time typesetting" "The triumph and near-tragedy of the first Moon landing"  "Retrotechnology – PC desktop screenshots from 1983-2005"  "Google Maps' Moat"  "Regex Parser in C Using Continuation Passing"  "AT&T giving $1000 bonus to all its employees because of tax reform"  "How a PR Agency Stole Our Kickstarter Money"  "Google Hangouts now on Firefox without plugins via WebRTC"  "Ubuntu 17.10 corrupting BIOS of many Lenovo laptop models"  "I Know What You Download on BitTorrent"  "Carrie Fisher’s Private Philosophy Coach"  "Show HN: Library of API collections for Postman"  "Uber is officially a cab firm, says European court"  "The end of the Iceweasel Age (2016)"  "Google will turn on native ad-blocking in Chrome on February 15"  "Bitcoin Cash deals frozen as insider trading is probed"
The rvest package supports pipe %>% operator. Thus, the R object containing the content of the HTML page (read with read_html) can be piped wi
th html_nodes() that takes a CSS selector or XPath as its argument. It can then extract the respective XML tree (or HTML node value) whose text value could be extracted wi
th html_text() function.
The beauty of rvest is that it abstracts the entire XML parsing operation under the hood of functions like html_nodes() and html_text(). Thus making it easier for us to achieve our scraping goal with minimal code.
Like with Title, the CSS selector value of other required elements of the web page can be identified with the Chrome Inspect tool. They can also be passed as an argument to html_nodes() function and respective values can be extracted and stored in R objects.
link_domain <- content %>% html_nodes('span.sitestr') %>% html_text()score <- content %>% html_nodes('span.score') %>% html_text()age <- content %>% html_nodes('span.age') %>% html_text()
All the essential pieces of information were extracted from the page. Now an R data frame can be made with the extracted elements to put the extracted data into a structured format.
df <- data.frame(title = title, link_domain = link_domain, score = score, age = age)
Below is the screenshot of the final dataframe in RStudio viewer:
Thus, in just 7 lines of code, we have successfully built a Hacker News Frontpage Scraper in R.
R is a wonderful language to perform Data Analysis and Data Visualization. The code used here is available on my github.