Back to blog

Scrapy Splash Tutorial: How to Scrape JavaScript-Rendered Websites

Roberta Aukstikalnyte

2023-04-255 min read
Share

Scrapy Splash is a lightweight browser with an HTTP API; it’s used to scrape websites that render data with JavaScript or AJAX calls. In today’s article, we’ll demonstrate how to use Scrapy Splash to your advantage when scraping data from websites with JavaScript rendering.

How to configure Splash

First, let’s take a look at the steps for installing Scrapy Splash. 

Scrapy Splash uses the Splash API, so you’ll also need to install it or use the Docker image (in this tutorial, we’re going to use Docker.) 

If you want to configure Splash manually without Docker, you can check the detailed instructions in the official installation documentation

1) How to set up and install Docker

Docker is an open-source containerization technology that’ll help us to run the Splash instance in a virtual container. To install Docker, you can use the below command in Linux:

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

If you’re using Windows, macOS, or any other operating system, you can find the relevant installation information on the Docker website.

2) Downloading and installing Splash

Once Docker is installed, you can use the `docker` command to pull the Splash Docker image from the Docker cloud.

docker pull scrapinghub/splash

The above command will download the Splash Docker image. Now, you’ll have to run it using the below command:

docker run -it -p 8050:8050 --rm scrapinghub/splash

The Splash instance will be available and ready to use on port 8050. So, if you visit: localhost:8050 on your browser, you’ll see the default Splash page.

3) Installing Scrapy via pip

Next, you’ll have to install Scrapy and the `scrapy-splash` plugin. The latter can be done with a single command as below:

pip install scrapy scrapy-splash

The `pip`command will download all the necessary dependencies from the Python package index(PyPI) and install them.

4) Setting up a new Scrapy project using the command line

The next step is to create a Scrapy project. You’ll need to run the following command:

scrapy startproject splashscraper

It’s as simple as that! You’ve now created your first Scrapy project named splashscraper. Depending on your Scrapy version, this command will create a project structure similar to the below:

splashscraper
├── scrapy.cfg
└── splashscraper
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

2 directories, 7 files

5) Configuring Scrapy to use Splash by modifying the settings file

Now, you’ll have to update the `settings.py` file to include some Splash specific settings. First, you’ll need to create a `SPLASH_URL` variable that points to the Splash instance that you booted in the second step.  

SPLASH_URL = 'http://localhost:8050'

Then, you’ll have to modify the `DOWNLOADER_MIDDLEWARES` to include the Splash middlewares.

DOWNLOADER_MIDDLEWARES = {
   'scrapy_splash.SplashCookiesMiddleware': 723,
   'scrapy_splash.SplashMiddleware': 725,
}

After that, you’ll have to update the `SPIDER_MIDDLEWARES` with another Splash middleware which is necessary for deduplication, and add a Splash aware duplicate filter.

SPIDER_MIDDLEWARES = {
   'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

The rest of the settings can be left with the default values for now.

Writing a Scrapy Splash spider

Scrapy has a built-in command to generate spiders. You’ll use this command to create the boilerplate of the spider. We’ll use the quotes.toscrape.com website as the target.

quotes to scrape

As you can see, this website has a lot of quotes. You can extract all the information available; also, you can use the pagination to navigate to all the other pages. First, let’s generate the spider using the below command:

scrapy genspider quotes quotes.toscrape.com

Once you execute this command, it’ll create a new file in the spiders directory. Let’s take a look at this file.

1) Understanding the basics of a Scrapy spider

In the spiders directory, you’ll find a file named quotes.py. Open it up in any text editor tool. It’ll look like below:

import scrapy

class QuotesSpider(scrapy.Spider):
   name = 'quotes'
   allowed_domains = ['quotes.toscrape.com']
   start_urls = ['http://quotes.toscrape.com/']

   def parse(self, response):
       pass

Notice all the class variables of the `QuotesSpider` class: the `allowed_domains` list restricts the spider to the listed domains. This way, it’s ensured that the spider won’t send network requests to websites or domains outside its scope. 

When the spider starts, it’ll scrape all the URLs listed in the `start_urls` asynchronously. The generator also created the `parse` method, which’ll get invoked for each url by default.

2) Writing a spider to scrape data from a single page

Right now, the spider isn’t doing anything. You’ll have to update the parse method to bring it to life. 

Inspect elements using a web browser

First, let’s check the HTML source of the website in a web browser using the dev tool. 

quotes to scrape

As you can see, each quote is wrapped in a div tag, which has a class name quote. The quote’s text is enclosed in a span tag. The author's name has a small tag wrapped around it. And lastly, the tags are available in a meta tag. 

Prepare the class SplashscraperItem 

Now, you’ll have to create a class to map the data into Scrapy items. By default, Scrapy uses the items.py file to store the class definition. Fortunately, you don’t have to write it from scratch – Scrapy already generated this file for you. Navigate to the items.py file in the project root directory. Then, modify it to include three fields: author, text, and tags.

import scrapy

class SplashscraperItem(scrapy.Item):
   author = scrapy.Field()
   text = scrapy.Field()
   tags = scrapy.Field()

Implement parse() method

Next, you’ll have to implement the parse method. You’ll begin by importing the `SplashscraperItem` class that you created in the previous step:

from items import SplashscraperItem

Then, you’ll have to implement the parse method as shown below: 

   def parse(self, response):
       for quote in response.css("div.quote"):
           text = quote.css("span.text::text").extract_first("")
           author = quote.css("small.author::text").extract_first("")
           tags = quote.css("meta.keywords::attr(content)").extract_first("")
           item = SplashscraperItem()
           item['text'] = text
           item['author'] = author
           item['tags'] = tags
           yield item

Following this `parse()` method, you can see the code is using the response object to extract all the quote `div` elements. After that, it iterates over each item to extract the text authors and tags. Once this data is extracted, it then creates a `SplashscraperItem()` with the data and yields the `item` object. 

3) Handling pagination

The code that you’ve written so far only works on a single page. So, let’s modify it further to navigate through all the pages using the pagination of the website. 

Head on to the website again using a web browser and scroll to the bottom, where you’ll see the “Next” button. As soon as you click it, it loads the second page. 

quotes to scrape

Now, you’ll need code so the process of getting to the next page is automated. 

By taking a closer look, we can see that the website is using an anchor tag nested inside a li element. You’ll have to add the following lines after the for loop that you’ve written earlier:

next_url = response.css("li.next>a::attr(href)").extract_first("")
if next_url:
    yield scrapy.Request(next_url, self.parse)

The next_url variable will contain the url of the next page. When it reaches the last page, the next_url will be an empty string. So, the if statement checks this condition to properly handle that case.

4) Adding Splash requests to the spider to scrape dynamic content using SplashRequest

To use SplashRequest, you’ll have to make some changes to the current spider. 

First, you’ll have to import the SplashRequest from the scrapy_splash library.

from scrapy_splash import ScrapyRequest

If the start_urls list is used,  Scrapy uses scrapy.Request by default. So, you’ll also have to replace it with the start_requests method.

  def start_requests(self):
       url = 'https://quotes.toscrape.com/'
       yield SplashRequest(url, self.parse, args={'wait': 1})

Last but not least, you’ll need to update the parse method to use SplashRequest as well. Once you make these changes, the code will look like the below:

import scrapy
from scrapy_splash import ScrapySplash
from items import SplashscraperItem

class QuotesSpider(scrapy.Spider):
   name = 'quotes'
   allowed_domains = ['quotes.toscrape.com']


   def start_requests(self):
       url = 'https://quotes.toscrape.com/'
       yield SplashRequest(url, self.parse, args={'wait': 1})

   def parse(self, response):
       for quote in response.css("div.quote"):
           text = quote.css("span.text::text").extract_first("")
           author = quote.css("small.author::text").extract_first("")
           tags = quote.css("meta.keywords::attr(content)").extract_first("")
           item = SplashscraperItem()
           item['text'] = text
           item['author'] = author
           item['tags'] = tags
           yield item
       next_url = response.css("li.next>a::attr(href)").extract_first("")
       if next_url:
           yield scrapy.SplashRequest(next_url, self.parse, args={'wait': 1})

Handling Splash responses

Splash responses have all the properties and attributes of a standard Scrapy Request. So, you don’t have to make any changes to the parse method. 

1) Understanding how Splash responds to requests and its response object

Scrapy Splash returns various Response subclasses for Splash requests depending on the request type, for example:

  • SplashResponse - returned for binary Splash responses that contain media files, e.g., image, video, audio, etc.;

  • SplashTextResponse - returned when the result is text;

  • SplashJsonResponse - returned when the result is a JSON object. 

2) Parsing data from Splash responses

You can use Scrapy’s built-in parser and Selector classes to parse Splash Responses. In the last example, you’ve used the response.css() method to use CSS Selectors to extract the desired data.

text = quote.css("span.text::text").extract_first("")
author = quote.css("small.author::text").extract_first("")
tags = quote.css("meta.keywords::attr(content)").extract_first("")

Note the `::text` inside the CSS selectors, it tells Scrapy to extract the text property from the element. Similarly, the `::attr(content)` denotes the content attribute of the meta tag. 

Join our Discord community

Exclusive events, support from experienced developers, and much more.

Join our Discord community

Conclusion

Scrapy Splash is an excellent tool for extracting content from dynamic websites in bulk. It enables Scrapy to leverage the Splash headless browser and scrape websites with JavaScript rendering at ease. 

If you enjoyed this blog post, make sure to check out our post on dynamic web scraping with Python and Selenium. In case JavaScript is your preferred language, check out this general web scraping guide using JavaScript & Node.js.

Frequently asked questions

What is the difference between Scrapy Splash and Selenium?

Scrapy Splash is an interface to the lightweight headless browser Splash. Meanwhile, Selenium is a web testing and automation framework. Scrapy Splash doesn’t require a third-party browser since it uses the Splash browser instance, but Selenium relies on various third-party web drivers, such as Chromium, Geckodriver, and so on.

If you’re curious to learn more about the differences between Scrapy and Selenium, see our detailed Scrapy vs. Selenium comparison.

What is the difference between Scrapy Splash and Beautiful Soup?

Beautiful Soup is a parsing library. Beautiful Soup can parse HTML pages using various parsers such as html.parser, lxml, etc. However, it can’t make any network requests, so it depends on network libraries such as requests, httpx, etc. Similarly, Scrapy Splash can interact with the Splash browser using the API but can not parse the content of the response. 

What is the difference between Scrapy Splash and Playwright?

Just like Selenium, Playwright is also primarily a testing framework that emphasizes software testing and automation. It’s a great tool for website interaction and human browsing simulation. 

Playwright uses third-party browser engines, making it slightly more resource-heavy than Scrapy Splash. Due to this, it can also be more stealthy than Scrapy Splash and bypass complex anti-bot measures, making web scraping easier.

About the author

Roberta Aukstikalnyte

Senior Content Manager

Roberta Aukstikalnyte is a Senior Content Manager at Oxylabs. Having worked various jobs in the tech industry, she especially enjoys finding ways to express complex ideas in simple ways through content. In her free time, Roberta unwinds by reading Ottessa Moshfegh's novels, going to boxing classes, and playing around with makeup.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

scrapingdigest

Get the latest news from web data intelligence world

I'm interested