Back to blog
Web Scraping With R: Step-by-Step Tutorial
Augustas Pelakauskas
Back to blog
Augustas Pelakauskas
In the dynamic digital business landscape, web scraping is a vital method used to extract data and gain insights. Whether using an in-house web scraper or a solution like SERP scraper, public data acquisition undoubtedly enhances various business prospects. Learning a new programming language for scraping, especially with R, can be challenging for beginners compared to Python. However, you shouldn’t overlook R since it’s geared towards statisticians and offers powerful libraries for efficient web scraping.
This tutorial simplifies web scraping with R and covers the basics, from scraping static pages to dynamic websites that use JavaScript for content rendering. For your convenience, we also prepared this tutorial in a video format:
The installation of the required components can be broken down into two sections — Installing R and RStudio and Installing the libraries.
The first stage is to prepare the development environment for R. Two components will be needed – R and RStudio.
To download and install R, visit this page. Installing the base distribution is enough.
Alternatively, you can use package managers such as Homebrew for Mac or Chocolatey for Windows.
For macOS, run the following:
brew install r
Link to GitHubFor Windows, run the following:
choco install r.project
Link to GitHubNext, download and install RStudio by visiting this page. The free version, RStudio Desktop, is enough.
If you prefer package managers, the following are the commands for macOS using Homebrew and for Windows using Chocolatey:
For macOS, run the following:
brew install --cask rstudio
Link to GitHubFor Windows, run the following:
choco install r.studio
Link to GitHubOnce installed, launch RStudio.
Launching RStudio
There are two ways to install the required libraries. The first is using the user interface of RStudio. Locate the Packages tab in the Help section. Select the Packages tab to activate the Packages section. In this section, click the Install button.
The Install Package dialog is now open. Enter the package names in the text box for Packages. Lastly, click Install.
For the first section of the tutorial, the package that we’ll use is rvest. We also need the dplyr package to allow the use of the pipe operator. Doing so makes the code easier to read.
Enter these two package names, separated with a comma, and click Install.
Installing libraries
The second way is to install these packages using a console. To proceed, run the following commands in the console:
install.packages("rvest")
install.packages("dplyr")
Link to GitHubThe libraries are now installed. The next step is to scrape data.
The most popular library for web scraping from any public web page in R is the rvest. It provides functions to access a public web page and query-specific elements using CSS selectors and XPath. The library is a part of the Tidyverse collection of packages for data science, meaning that the coding conventions are the same across all of Tidyverse's libraries.
Let's initiate a web scraping operation using rvest. The first step is to send an HTTP GET request to a target web page. We'll be working with many rvest examples.
This section is written as an rvest cheat sheet. You can jump to any section that you need help with.
Begin with loading the rvest library by entering the following in the Source area:
library(rvest)
Link to GitHubAll of the commands entered in the source areas can be executed by simply placing the cursor in the desired line, selecting it, and then clicking the Run button on the top right of the Source area.
Alternatively, depending on your operating system, you can press Ctrl + Enter or Command + Enter.
In this example, we'll scrape publicly available data from a web page that lists ISO Country Codes. The hyperlink can be stored in a variable:
link = "https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes"
Link to GitHubTo send an HTTP GET request to this page, a simple function read_html() can be used.
This function needs one mandatory argument: a path or a URL. Note that this function can also read an HTML code string:
page = read_html(link)
Link to GitHubThe function above sends the HTTP GET request to the URL, retrieves the web page, and returns an object of html_document type.
The html_document object contains the desired public data in the raw HTML document. Many rvest functions are available to query and extract specific HTML file elements.
Note that if you need to use a rvest proxy, run the following to set the proxy in your script:
Sys.setenv(http_proxy="http://proxyserver:port")
The read_html doesn't provide any way to control the time out. To handle rvest read_html timeouts, you can use the httr library. The GET function from this library and tryCatch can help you handle the time-out errors.
Alternatively, you can use the session object from rvest as follows:
library(httr)
url <- "https://quotes.toscrape.com/api/quotes?page=1"
page<-read_html(GET(url, timeout(10))) # Method 1
page <- session(url,timeout(10)) #Method 2
The rvest package provides a convenient way to select HTML elements using CSS and XPath selectors.
Select the elements using html_elements() function. The syntax of this function is as follows:
page %>% html_elements(css="")
page %>% html_elements(xpath="")
Link to GitHubAn important aspect to note is the plural variation, which will return a list of matching elements. There's a singular variation of this function that returns only the first matching HTML element:
page %>% html_element()
Link to GitHubIf the selector type isn't specified, it's assumed to be a CSS Selector. For example, this Wiki web page contains the desired public data in a table.
An HTML markup of the table
The HTML markup of this table is as follows:
<table class="wikitable sortable jquery-tablesorter">
The only class needed to create a unique selector is the sortable class. It means that the CSS selector can be as simple as table.sortable. Using this selector, the function call will be as follows:
htmlElement <- page %>% html_element("table.sortable")
It stores the resulting html_element in a variable htmlElement.
The next step of our web scraping in R project is to convert the public data contained in html_element into a data frame.
In the previous section, we discussed selecting an element using the html_element function.
This function makes it easy to use the rvest select class. For example, if you want to select an element that has the class heading, all you need to write is the following line of code:
heading <- page %>% html_element(".heading")
Another use case is the rvest div class. If you want to use rvest to select a div, you can use something like:
page %>% html_element("div")
If you also use rvest to select div with a class:
page %>% html_element("div.heading")
You may come across to select HTML file nodes is the html_node() function. Note that this way of selecting HTML nodes in rvest is now obsolete. Instead, you should be using html_element() and html_elements().
From this element, you can extract text by calling the function html_text() as follows:
heading %>% html_text()
Link to GitHubAlternatively, if you're looking for an attribute, you can use the rvest html_attr function. For example, the following code will extract the src attribute of an element:
element %>% html_attr("src")
Link to GitHubYou can use the rvest read table function if you're working with HTML tables. This function takes an HTML that contains <table> elements and returns a data frame.
html_table(htmlElement)
Link to GitHubYou can use this to build rvest extract table code:
page %>% html_table()
As you can see, we can send the whole page and rvest reads tables, all of them.
If the page you are scraping uses JavaScript, there are two ways to scrape it. The first method is to use RSelenium. This approach is covered at length in the next section of this article.
In this section, let's talk about the second approach. This approach involves finding the hidden API that contains the data.
This is an excellent example to learn how rvest JavaScript works. This site uses infinite scroll.
Open this site in Chrome, press F12, and go to the network tab. Once we have network information, we can implement rvest infinite scrolling easily.
Scroll down to load more content and watch the network traffic. You'll notice that every time a new set of quotes are loaded, a call to the URL is sent, where the page number keeps on increasing.
Another thing to note is that the response is returned in JSON. There's an easy way to build a rvest JSON parser.
First, read the page.Then look for the <p> tag. This will contain the JSON data in text format.
page <- read_html("https://quotes.toscrape.com/api/quotes?page=1")
json_as_text <- page %>% html_element("p") %>% html_text()
Link to GitHubTo parse this JSON text into an R object, we need to use another library – jsonlite:
library(jsonlite)
Now, use the fromJSON method to convert this rvest JSON text into a native R object.
r_object <- json_as_text %>% fromJSON()
You can use a loop to parse rvest javascript for a page with infinite scroll. In the following example, we're running this loop ten times:
for (x in 1:10) {
url <- paste("https://quotes.toscrape.com/api/quotes?page=",x, sep = '')
page <- read_html(url)
# parse page to get JSON
}
You can modify this code as per your specific requirements.
Data frames are fundamental data storage structures in R. They resemble matrices but feature some critical differences. Data frames are tightly coupled collections of variables, where each column can be of a different data type. It's a powerful and efficient way of storing a large amount of data.
Most data and statistical analysis methods require data stored in data frames.
To convert the data stored in html_element, the function html_table can be used:
df <- html_table(htmlEl, header = FALSE)
Link to GitHubThe variable df is a data frame.
Note the use of an optional parameter header = FALSE. This parameter is only required in certain scenarios. In most cases, the default value of TRUE should work.
For the Wiki table, the header spawns two rows. Out of these two rows, the first row can be discarded, making it a three-step process.
The first step is to disable the automatic assignment of headers, which we have already done.
The next step is to set the column names with the second row:
names(df) <- df[2,]
Link to GitHub3. The third step is to delete the first two rows from the body.
df = df[-1:-2,]
Link to GitHubIt's now ready for further analysis.
Finally, the last step to scrape data from the HTML document is to save the data frame to a CSV file.
To export it, use the write.csv function. This function takes two parameters – the data frame instance and the name of the CSV file:
write.csv(df, "iso_codes.csv")
Link to GitHubThe function will export the data frame to a file iso_codes.csv in the current directory.
Images are easy to download with rvest. This involves a three-step process:
Downloading the page;
Locating the element that contains the URL of the desired image and extracting the URL of the image;
Downloading the image.
Let's begin by importing the packages.
library(rvest)
library(dplyr)
Link to GitHubWe'll download the first image from the Wikipedia page in this example. Download the page using the read_htmlI() function and locate the <img> tag that contains the desired image.
url = "https://en.wikipedia.org/wiki/Eiffel_Tower"
page <- read_html(url)
Link to GitHubTo locate the image, use the CSS selector ".infobox-image img".
image_element <- page %>% html_element(".infobox-image img")
Link to GitHubThe next step is to get the actual URL of the image, which is embedded in the src attribute. The rvest function html_attr() comes handy here.
image_url <- image_element %>% html_attr("src")
Link to GitHubThis URL is a relative URL. Let's convert this to an absolute URL. This can be done easily using one of the rvest functions — url_absolute() as follows:
image_url <- url_absolute(image_url, url)
Link to GitHubFinally, use another rvest function — download() to download the file as follows:
download.file(image_url, destfile = basename("paris.jpg"))
Link to GitHubThe most popular languages for public data analysis are Python and R. To analyze data, first, we need to collect publicly available data. The most common technique for collecting public data is web scraping. Thus, Python and R are suitable languages for web scraping, especially when the data needs to undergo analysis.
In this section, we'll quickly look at rvest vs beautifulsoup.
The BeautifulSoup library in Python is one of the most popular web scraping libraries because it provides an easy-to-use wrapper over the more complex libraries such as lxml. Rvest is inspired by BeautifulSoup. It's also a wrapper over more complex R libraries such as xml2 and httr.
Both Rvest and BeautifulSoup can query the document DOM using CSS selectors.
Rvest provides additional functionality to use Xpath, which BeautifulSoup lacks. BeautifulSoup instead uses its functions to compensate for the lack of XPath. Note that XPath allows traversing up to the parent node, while CSS cannot do that.
BeautifulSoup is only a parser. It's helpful for searching elements on the page but can't download web pages. You would need to use another library such as Requests for that.
Rvest, on the other hand, can fetch the web pages.
Eventually, the decision of rvest vs BeautifulSoup would depend on your familiarity with the programming language. If you know Python, use BeautifulSoup. If you know R, use Rvest.
While the rvest library works for most static websites, some dynamic websites use JavaScript to render the content. For such websites, a browser-based rendering solution comes into play.
Selenium is a popular browser-based rendering solution that can be used with R. Among the many great features of Selenium are taking screenshots, scrolling down pages, clicking on specific links or parts of the page, and inputting any keyboard stroke onto any part of a web page. It's the most versatile when combined with classic web scraping techniques.
The library that allows dynamic page scraping is RSelenium. It can be installed using the RStudio user interface as explained in the first section of this article, or by using the following command:
install.packages("RSelenium")
Link to GitHubOnce the package is installed, load the library using the following command:
library(RSelenium)
Link to GitHubThe next step is to start the Selenium server and browser.
There are two ways of starting a Selenium server and getting a client driver instance.
The first is to use RSelenium only, while the second way is to start the Selenium server using Docker and then connect to it using RSelenium. Let's delve deeper into how the first method works.
For the first method, download and install ChromeDriver.
RSelenium allows to setup the Selenium server and browser using the following function calls:
rD <- rsDriver(browser="chrome", port=9515L, verbose=FALSE)
remDr <- rD[["client"]]
Link to GitHubThis will download the required binaries, start the server, and return an instance of the Selenium driver.
Alternatively, you can use Docker to run the Selenium server and connect to this instance.
Install Docker and run the following command from the terminal.
docker run -d -p 4445:4444 selenium/standalone-firefox
Link to GitHubThis will download the latest Firefox image and start a container. Apart from Firefox, Chrome and PhantomJS can also be used.
Once the server has started, enter the following in RStudio to connect to the server and get an instance of the driver:
remDr <- remoteDriver(
remoteServerAddr = "localhost",
port = 4445L,
browserName = "firefox"
)
remDr$open()
Link to GitHubThese commands will connect to Firefox running in the Docker container and return an instance of the remote driver. If something isn't working, examine both the Docker logs and RSelenium error messages.
Note that after visiting a website and before moving on to the parsing functions, it might be essential to let a considerable amount of time pass. There's a possibility that data won’t be loaded yet, and the entire parsing algorithm will crash. The specific functions could be employed that wait for the particular HTML elements to load fully.
The first step is navigating the browser to the desired page. As an example, we'll scrape the name, prices, and stock availability for all books in the science fiction genre. The target is a dummy book store for practicing web scraping.
To navigate to this URL, use the navigate function:
remDr$navigate("https://books.toscrape.com/catalogue/category/books/science-fiction_16")
Link to GitHubTo locate the HTML elements, use findElements() function. This function is flexible and can work with CSS Selectors, XPath, or even with specific attributes, such as an id, name, name tag, etc. For a detailed list, see the official documentation.
In this example, we'll work with XPath.
The book titles are hidden in the alt attribute of the image thumbnail.
Locating book titles
The XPath for these image tags will be //article//img. The following line of code will extract all of these elements:
titleElements <- remDr$findElements(using = "xpath", "//article//img")
Link to GitHubTo extract the value of the alt attribute, we can use the getElementAttribute() function. However, in this particular case we have a list of elements.
To extract the attribute from all elements of the list, a custom function can be applied using the sapply function of R:
titles <- sapply(titleElements, function(x){x$getElementAttribute("alt")[[1]]})
Link to GitHubNote that this function will return the attribute value as a list. That's why we're using [[1]] to extract only the first value.
Moving on to extracting price data, the following is an HTML markup of the HTML element containing price:
<p class="price_color">£37.59</p>
Link to GitHubThe XPath to select this will be //*[@class='price_color']. Also, this time we'll use the getElementText() function to get the text from the HTML element. This can be done as follows:
pricesElements <- remDr$findElements(using = "xpath", "//*[@class='price_color']")
prices <- sapply(pricesElements, function(x){x$getElementText()[[1]]})
Link to GitHubLastly, the lines that extract stock availability will be as follows:
stockElements <- remDr$findElements(using = "xpath", "//*[@class='instock availability']")
stocks <- sapply(stockElements, function(x){x$getElementText()[[1]]})
Link to GitHubAt this point, there are three variables. Every variable is a list that contains a required data point.
Data points can be used to create a data frame:
df <- data.frame(titles, prices, stocks)
Link to GitHubOnce the data frame is created, it can be used for further analysis.
Moreover, it can be easily exported to CSV with just one line:
write.csv(df, "books.csv")
Link to GitHubYou can click here to find the complete code used in this article for your convenience.
Web scraping with R is a relatively uncomplicated and straightforward process if you are already familiar with the intricacies of R or programming in general. For most static web pages, the rvest library provides enough functionality, and you shouldn’t run into major setbacks. However, if any kind of dynamic elements come into play, a typical HTML extraction won’t be up to the task. If so, more often than not, RSelenium is the right solution to alleviate a more complicated load. When it gets too challenging, a dedicated and advanced web scraping API can save the day. For general web scraping, be sure to check out one of our Scraper API solutions.
If you want to find out more on how to scrape the web using other programming languages, check our articles, such as Web Scraping with JavaScript, Web Scraping with Java, Web Scraping with C#, Python Web Scraping Tutorial, What is Jupyter Notebook: Introduction, and many more.
R is a popular choice for public data web scraping, and deservedly so. It’s open-source, has powerful libraries, and is relatively easy to use. Since R has built-in data analysis functionalities, it’s commonly used for statistical analysis. Thus, R is indeed good for web scraping; however, it all depends on your use case and requirements. For instance, if your project requires dealing with complex data visualization and analysis, then you might be better off using R instead of a programming language like Python.
Both languages are deemed easy to learn and use, and both have robust libraries for web scraping. Python is a general-purpose programming language, while R is geared more towards statistical analysis; hence, if you already have experience with Python, your transition to web scraping may be easier if you stick with Python.
Additionally, both languages have different syntax, so you may find one more intuitive than the other. Another thing to note is that Python has a larger active community. Thus, it may be easier to find resources and help online compared to R.
All in all, it solely depends on your end goal and familiarity with other programming languages. For example, if you know Java or C++, then Python may be easier for you to learn than R due to their syntax similarity.
The most popular package for web scraping in R is the rvest package; however, it doesn’t support data scraping from websites that use JavaScript to load content dynamically. Hence, when it comes to dynamic content, you may want to use RSelenium. Other honorable mentions include httr and scrapyR packages.
About the author
Augustas Pelakauskas
Senior Copywriter
Augustas Pelakauskas is a Senior Copywriter at Oxylabs. Coming from an artistic background, he is deeply invested in various creative ventures - the most recent one being writing. After testing his abilities in the field of freelance journalism, he transitioned to tech content creation. When at ease, he enjoys sunny outdoors and active recreation. As it turns out, his bicycle is his fourth best friend.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Yelyzaveta Nechytailo
2024-12-09
Augustas Pelakauskas
2024-12-09
Get the latest news from data gathering world
Scale up your business with Oxylabs®