avatar

Augustas Pelakauskas

Dec 15, 2021 12 min read

With a rapidly evolving digital business landscape, it is impossible to neglect the power of web scraping. Web scraping extracts data for analysis to construct insights that could shift and improve various business prospects.

If you need to learn a new programming language, getting started with web scraping can be daunting. Thankfully, more and more programming languages provide powerful libraries to help scrape data from web pages more conveniently.

One of the most popular programming languages for data and statistical analysis is R. R is an open-source programming language with many open-source libraries that make web scraping accessible to wider audiences.

Nonetheless, R can be challenging for beginners, especially when compared to other more widespread languages such as Python. It is important to keep in mind that R is targeted for statisticians and data analysts, while Python is more of a general-purpose programming language. If you already know R, scraping data from web pages is relatively straightforward. Web pages can be converted to data frames or CSV files for further analysis.

This tutorial covers the basics of web scraping with R. We’ll begin with the scraping of static pages and shift the focus to the techniques that can be used for scraping data from dynamic websites that use JavaScript to render the content.

Installing requirements

The installation of the required components can be broken down into two sections — Installing R and RStudio and Installing the libraries.

Installing R and RStudio

The first stage is to prepare the development environment for R. Two components will be needed – R and RStudio.

  • To download and install R, visit this page. Installing the base distribution is enough.

Alternatively, you can use package managers such as Homebrew for Mac or Chocolatey for Windows.

For macOS, run the following:

brew install r

For Windows, run the following:

choco install r.project
  • Next, download and install RStudio by visiting this page. The free version, RStudio Desktop, is enough.

If you prefer package managers, the following are the commands for macOS using Homebrew and for Windows using Chocolatey:

For macOS, run the following:

brew install --cask r-studio

For Windows, run the following:

choco install r.studio

Once installed, launch RStudio.

Launching RStudio

Installing required libraries

There are two ways to install the required libraries. The first is using the user interface of RStudio. Locate the Packages tab in the Help section. Select the Packages tab to activate the Packages section. In this section, click the Install button.

The Install Package dialog is now open. Enter the package names in the text box for Packages. Lastly, click Install.

For the first section of the tutorial, the package that we’ll use is rvest. We also need the dplyr package to allow the use of the pipe operator. Doing so makes the code easier to read.

Enter these two package names, separated with a comma, and click Install.

Installing libraries

The second way is to install these packages using a console. To proceed, run the following commands in the console:

install.packages("rvest")
install.packages("dplyr")

The libraries are now installed. The next step is to start scraping data.

Web scraping with rvest

The most popular library for web scraping from any public web page in R is the rvest. It provides functions to access a public web page and query specific elements using CSS selectors and XPath. The library is part of the Tidyverse collection of packages for data science, meaning that the coding conventions are the same across all of the Tidyverse’s libraries.

Let’s initiate a web scraping operation using rvest. The first step is to send an HTTP GET request to a target web page.

Sending the GET request

Begin with loading the rvest library by entering the following in the Source area:

library(rvest)

All of the commands that are entered in the source areas can be executed by simply placing the cursor in the desired line, or selecting it, and then clicking the Run button on the top right of the Source area.

Alternatively, you can press Ctrl + Enter or Command + Enter, depending on your operating system.

In this example, we’ll scrape data from a public web page that lists ISO Country Codes. The hyperlink can be stored in a variable:

link = "https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes"

To send an HTTP GET request to this page, a simple function read_html() can be used.

This function needs one mandatory argument, which can be a path or a URL. Note that this function can also read an HTML string:

page = read_html(link)

The function above sends the HTTP GET request to the URL, retrieves the web page, and returns an object of html_document type.

The html_document object contains the desired data from the HTML document. There are many functions to query this HTML document to extract specific HTML elements.

Parsing HTML content

The rvest package provides a convenient way to select the HTML elements using CSS Selectors, as well as XPath.

Select the elements using html_elements() function. The syntax of this function is as follows:

page %>% html_elements(css="")
page %>% html_elements(xpath="")

An important aspect to note is the plural variation, which is going to return a list of matching elements. There is a singular variation of this function that returns only the first matching HTML element:

page %>% html_element()

If the selector type is not specified, it is assumed to be a CSS Selector.

For example, this Wiki web page contains the desired data in a table.

An HTML markup of the table

The HTML markup of this table is as follows:

<table class="wikitable sortable jquery-tablesorter">

The only class needed to create a unique selector is the sortable class. It means that the CSS selector can be as simple as table.sortable. Using this selector, the function call will be as follows:

htmlElement <- page %>% html_element("table.sortable")

It stores the resulting html_element in a variable htmlElement.

The next step of our web scraping project is to convert the data contained in html_element into a data frame.

Saving data to a data frame

Data frames are fundamental data storage structures in R. They resemble matrices but feature some important differences. Data frames are tightly coupled collections of variables, where each column can be of a different data type. It is a powerful and efficient way of storing a large amount of data.

Most data and statistical analysis methods require data stored in data frames.

To convert the data stored in html_element, the function html_table can be used:

df <- html_table(htmlEl, header = FALSE)

The variable df is a data frame.

Note the use of an optional parameter header = FALSE. This parameter is only required in certain scenarios. In most cases, the default value of TRUE should work.

For the Wiki table, the header spawns two rows. Out of these two rows, the first row can be discarded, making it a three-step process.

  1. The first step is to disable the automatic assignment of headers, which we have already done.
  2. The next step is to set the column names with the second row:
names(df) <- df[2,]

3. The third step is to delete the first two rows from the body of the data frame.

df = df[-1:-2,]

The data frame is now ready for further analysis.

Exporting data frame to a CSV file

Finally, the last step of extracting data from the HTML document is to save the data frame to a CSV file.

To export the data frame, use the write.csv function. This function takes two parameters – the data frame instance and the name of the CSV file:

write.csv(df, "iso_codes.csv")

The function will export the data frame to a file iso_codes.csv in the current directory.

Web scraping with RSelenium

While the rvest library works for most static websites, some dynamic websites use JavaScript to render the content. For such websites, a browser-based rendering solution comes into play.

Selenium is a popular browser-based rendering solution that can be used with R. Among the many great features of Selenium are taking screenshots, scrolling down pages, clicking on specific links or parts of the page, and inputting any keyboard stroke onto any part of a web page. It’s the most versatile when combined with classic web scraping techniques.

The library that allows dynamic page scraping is RSelenium. It can be installed using the RStudio user interface as explained in the first section of this article, or by using the following command:

install.package("RSelenium")

Once the package is installed, load the library using the following command:

library(RSelenium)

The next step is to start the Selenium server and browser.

Starting Selenium

There are two ways of starting a Selenium server and getting a client driver instance.

The first is to use RSelenium only, while the second way is to start the Selenium server using Docker and then connect to it using RSelenium. 

  • This is how the first method works.

RSelenium allows to setup the Selenium server and browser using the following function calls:

rD <- rsDriver(browser="chrome", port=9515L, verbose=FALSE)
remDr <- rD[["client"]]

This will download the required binaries, start the server, and return an instance of the Selenium driver.

  • Alternatively, you can use Docker to run the Selenium server and connect to this instance.

Install Docker and run the following command from the terminal.

docker run -d -p 4445:4444 selenium/standalone-firefox

This will download the latest Firefox image and start a container. Apart from Firefox, Chrome and PhantomJS can also be used.

Once the server has started, enter the following in RStudio to connect to the server and get an instance of the driver:

remDr <- remoteDriver(
  remoteServerAddr = "localhost",
  port = 4445L,
  browserName = "firefox"
)
remDr$open()

These commands will connect to Firefox running in the Docker container and return an instance of the remote driver. If something is not working, examine both the Docker logs and RSelenium error messages.

Working with elements in Selenium

Note that after visiting a website and before moving on to the parsing functions, it might be essential to let a considerable amount of time to pass. There is a possibility that data won’t be loaded yet, and the entire parsing algorithm will crash. The specific functions could be employed that wait for the particular HTML elements to load fully.

The first step is navigating the browser to the desired page. As an example, we’ll scrape the name, prices, and stock availability for all books in the science fiction genre. The target is a dummy book store for practicing web scraping.

To navigate to this URL, use the navigate function:

remDr$navigate("https://books.toscrape.com/catalogue/category/books/science-fiction_16")

To locate the HTML elements, use findElements() function. This function is flexible and can work with CSS Selectors, XPath, or even with specific attributes, such as an id, name, name tag, etc. For a detailed list, see the official documentation.

In this example, we’ll work with XPath.

The book titles are hidden in the alt attribute of the image thumbnail.

Locating book titles

The XPath for these image tags will be //article//img. The following line of code will extract all of these elements:

titleElements <- remDr$findElements(using = "xpath", "//article//img")

To extract the value of the alt attribute, we can use the getElementAttribute() function. However, in this particular case we have a list of elements.

To extract the attribute from all elements of the list, a custom function can be applied using the sapply function of R:

titles <- sapply(titleElements, function(x){x$getElementAttribute("alt")[[1]]})

Note that this function will return the attribute value as a list. That’s why we are using [[1]] to extract only the first value.

Moving on to extracting price data, the following is an HTML markup of the HTML element containing price:

<p class="price_color">£37.59</p>

The XPath to select this will be //*[@class='price_color']. Also, this time we’ll use the getElementText() function to get the text from the HTML element. This can be done as follows:

pricesElements <- remDr$findElements(using = "xpath", "//*[@class='price_color']")
prices <-  sapply(pricesElements, function(x){x$getElementText()[[1]]})

Lastly, the lines that extract stock availability will be as follows:

stockElements <- remDr$findElements(using = "xpath", "//*[@class='instock availability']")
stocks <-  sapply(stockElements, function(x){x$getElementText()[[1]]})

Creating a data frame

At this point, there are three variables. Every variable is a list that contains a required data point.

Data points can be used to create a data frame:

df <- data.frame(titles, prices, stocks)

Once the data frame is created, it can be used for further analysis.

Moreover, the data frame can be easily exported to CSV with just one line:

write.csv(df, "books.csv")

Conclusion

Web scraping with R is a relatively uncomplicated and straightforward process if you are already familiar with the intricacies of R or programming in general. For most static web pages, the rvest library provides enough functionality, and you shouldn’t run into major setbacks. However, if any kind of dynamic elements come into play, a typical HTML extraction won’t be up to the task. If so, more often than not, RSelenium is the right solution to alleviate a more complicated load.

If you want to find out more on how to scrape the web using other programming languages, check our articles, such as Web Scraping with JavaScript, Web Scraping with Java, Web Scraping with C#, Python Web Scraping Tutorial, and many more.

avatar

About Augustas Pelakauskas

Augustas Pelakauskas is a Copywriter at Oxylabs. Coming from an artistic background, he is deeply invested in various creative ventures - the most recent one being writing. After testing his abilities in the field of freelance journalism, he transitioned to tech content creation. When at ease, he enjoys sunny outdoors and active recreation. As it turns out, his bicycle is his third best friend.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Meeting the Most Ambitious SEO Needs With Fresh Data
Meeting the Most Ambitious SEO Needs With Fresh Data

Jan 26, 2022

3 min read

OxyCast: A New Podcast on Everything Web Scraping Related
OxyCast: A New Podcast on Everything Web Scraping Related

Jan 20, 2022

2 min read

Python Web Scraping Tutorial: Step-By-Step
Python Web Scraping Tutorial: Step-By-Step

Jan 06, 2022

21 min read