Back to blog

Serverless Web Scraping with Scrapy and AWS Lambda

Serverless Web Scraping with Scrapy and AWS Lambda

Yelyzaveta Nechytailo

2023-08-025 min read
Share

In today's data-driven world, effective web scraping is crucial. Managing a fleet of servers to perform web scraping operations is only sometimes the most convenient or cost-effective for engineers and developers. This is where the power of serverless computing and web scraping frameworks comes into play. This article will explore using Scrapy, a popular web scraping framework, with AWS Lambda, a serverless computing platform from Amazon Web Services. 

What is serverless web scraping?

Serverless web scraping harnesses the power of serverless computing, like AWS Lambda, and a web crawling framework, such as Scrapy, to efficiently extract data from the web. By combining these technologies, developers can create robust, scalable, and cost-effective web scraping solutions without needing to manage any servers or pay for idle time.

What is AWS Lambda?

AWS Lambda is a serverless computing service provided by Amazon Web Services. This technology allows developers to run their code without managing servers, automatically scaling to handle any workload. In simpler terms, it takes care of all the infrastructure to run your code on a high-availability, distributed compute infrastructure, ensuring your code is optimally allocated, and the right resources are used.

The serverless aspect ensures optimal allocation of resources, allowing the application to scale up during high demand and scale down during quieter periods. This elasticity makes serverless web scraping a smart choice for projects with unpredictable load or where high-scale operations are needed sporadically.

On the other hand, using a powerful web crawling framework like Scrapy offers comprehensive tools to scrape websites effectively and with greater control. This framework enables handling complex data extraction and storing the scraped data in the desired format.

What is Scrapy?

Scrapy is an open-source and collaborative web crawling framework written in Python. It's designed to handle various web scraping tasks including web scraping and processing scraped data. With its built-in functionality for extracting and storing data in your preferred structure and format, Scrapy stands out as a robust framework for web scraping.

In order to perform web scraping with Scrapy effectively, we recommended integrating it with Oxylabs’ Residential or Datacenter Proxies. The biggest advantage of integrating proxies with Scrapy is that they will allow you to hide your actual IP address from being visible to the original site's server while scraping. Using proxies protects your privacy and prevents you from being banned from the target sites when you’re using an automated tool rather than manually copying and pasting data from the site.

How to run Scrapy in AWS Lambda

Step 1: Creating Scrapy spider

Setting up a Scrapy crawler is the first step towards serverless web scraping. Scrapy uses spiders, which are self-contained crawlers that are given a set of instructions.

Here's an example of a simple Scrapy spider:

import scrapy


class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/"]

    def parse(self, response):
        for s in response.css("article"):
            yield {
                "title": s.css("h3 a::attr(title)").get(),
                "price": s.css("p.price_color::text").get(),
            }
        next_page = response.css("li.next > a::attr(href)").get()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page))

In this example, we've created a simple spider that will start at https://books.toscrape.com, extract the title and price of all the books, go to the next page, and repeat. The final result is scraped data from 1,000 books.

If you run the spider on your machine, you can use the -o switch to redirect the output to a file. For example, the following command creates a books.json file:

scrapy runspider books.py -o books.json

Step 2: Modifying Scrapy spider for AWS 

We cannot access the terminal or the file system when we run the Scrapy spider as an AWS Lambda function. It means that we send the output to a local file system and retrieve it. The alternative is to store the output in a S3 bucket.

Modify the spider code to add these custom settings:

class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/"]

    custom_settings = {
        "FEED_URI": "s3://YOUR-BUCKET-NAME/items.json",
        "FEED_FORMAT": "json",
    }
#rest of the code

Replace YOUR-BUCKET-NAME-HERE with the actual S3 bucket you created. 

Step 3: Preparing the environment for Lambda function 

Before we can configure the local environment for AWS lambda, install the following executables:

  • Docker

  • AWS CLI

  • Serverless Framework

  • boto3 package

Configuring the Lambda function requires creating a Docker image of our Scrapy spider and uploading it to the AWS platform. Docker allows you to package an application with its environment and dependencies into a container, which can be quickly shipped and run anywhere. This process ensures that our application will run the same, regardless of any customized settings or previously installed software on the AWS Lambda server that could differ from our local development environment.

You can download and install Docker Personal from docker.com. Ensure that you can run the docker executable from the command line.

The AWS Command Line Interface (AWS CLI) is a powerful tool that allows users to interact with Amazon Web Services (AWS) using the command-line interface of your operating system. It provides a convenient and efficient way to manage various AWS services and resources without a graphical user interface.

To install AWS CLI, visit https://aws.amazon.com/cli/ and download the package for your operating system.

For AWS Lambda to handle Scrapy, we'll need to create a Docker image of our Scrapy spider and upload it to the AWS platform.

You can install Serverless Framework using npm. If you still need to set it up, go to the official Node.js website: https://nodejs.org/, download LTS, and install it. After that, run the following command:

$ npm install -g serverless

Botocore is a Python library developed by Amazon Web Services (AWS). It is the foundation for the AWS SDK for Python (Boto3). It is a low-level interface that provides the core functionality for interacting with AWS services through Python code. 

Ensure that you have created a virtual environment and activated it, and run the following:

(venv) $ pip install boto3

Not that this virtual environment should also have Scrapy installed. Install it if you haven't installed it using pip:

(venv) $ pip install scrapy

Step 4: Prepare the code for Lambda function deployment

To ensure the Docker container is manageable and contains only the necessary dependencies, creating a requirements.txt file that lists all Python packages your Scrapy spider needs to run is crucial. This file might look like this:

requests==2.31.0
requests-file==1.5.1
Scrapy==2.6.0
boto3==1.28.14
service-identity==21.1.0
cryptography==38.0.4
# more packages

The next step is to create a docker image. We can use the Create a file and name it Dockerfile. This file will have the following contents: 

FROM public.ecr.aws/lambda/python:3.10

# Required for lxml
RUN yum install -y gcc libxml2-devel libxslt-devel
COPY . ${LAMBDA_TASK_ROOT}
RUN pip3 install -r requirements.txt
CMD [ "lambda_function.handler" ]

In this Dockerfile, we're starting with a basic Python image, setting the working directory to /app, copying our application into the Docker container, installing any necessary requirements, and setting the command to run our spider. This command is the script entrypoint.sh

Create a new file and save it as lambda_function.py. Enter the following in this file:

import sys


def handler(event, context):
    # Run the Scrapy spider
    import subprocess
    subprocess.run(["scrapy", "runspider", "books.py"])
    return { 
        'statusCode': '200',   # a valid HTTP status code
        'body': 'Lambda function invoked',        
    }

Finally, for deployment, we will need a YML file. Add a new file, save it as serverless.yml, and add the following code:

service: scrapy-lambda

provider:
  name: aws
  runtime: python3.9
  stage: dev
  region: us-east-1
  environment:
    BUCKET: my-bucket
  iamRoleStatements:
    - Effect: "Allow"
      Action:
        - "s3:*"
      Resource: "arn:aws:s3:::${self:provider.environment.BUCKET}/*"

functions:
  scrapyFunction:
    image: YOUR_REPO_URI:latest
    events:
      - http:
          path: scrape
          method: post
          cors: true

We will update the YOUR_REPO_URI in the next section. Also, note that scrapy-lambda is just the name of the docker image we will create in the next section.

Step 5: Deploying Docker image

The first step is to create a user using AWS IAM. Take note of the Access Key and Secret Access Key.

Execute the following and enter these keys when prompted:

aws configure

Next, create a new ECR repository by running the following:

$ aws ecr create-repository --repository-name YOUR_REPO_NAME

From the JSON output, take note of the repositoryUri value. It will look like 76890223446.dkr.ecr.us-east-1.amazonaws.com/scrapy-images. 

Replace the YOUR_REPO_URI in the serverless.yml file with this value.

Go to the AWS console if you still need to create an S3 bucket. 

Take note of the bucket name and update the Scrapy spider code. Replace YOUR-BUCKET-NAME-HERE in the books.py with the actual S3 bucket name.

Now, build your Docker image with the following command:

$ docker build -t scrapy-lambda
docker-build

Tag and push your Docker image to Amazon ECR using the following commands:

$ aws ecr get-login-password --region region | docker login --username AWS --password-stdin YOUR_REPO_URI

$ docker tag scrapy-lambda:latest YOUR_REPO_URI:latest
$ docker push YOUR_REPO_URI:latest

Replace region with your AWS region and YOUR_REPO_NAME with your Amazon ECR repository. 

Finally, deploy the images using the following command:

$ sls deploy
SLS-deploy

The output of this command should be as follows:

Service deployed to stack scrapy-lambda-dev (79s)

endpoint: POST - https://abcde.execute-api.us-east-1.amazonaws.com/dev/scrape

Step 6: Running the Lambda function 

When you run the sls deploy command, you will see the service URL endpoint in the command output.

To execute this function, send a POST request to this URL as follows:

curl -X POST https:// abcde.execute-api.us-east-1.amazonaws.com/dev/scrape

The lambda function execution will begin, and the output will be stored in S3 bucket.

s3-output

Join our Discord community

Exclusive events, support from experienced developers, and much more.

Join our Discord community

Conclusion

This article covered the topic of serverless web scraping and how to run Scrapy as a Lambda function. We discussed the prerequisites, setting up a Scrapy crawler, configuring AWS Lambda for serverless scraping, and storing the scraped data in an AWS S3 bucket. With this knowledge, you should be able to harness the power of serverless web scraping, helping you perform more efficient and cost-effective data collection.

If you’re looking for more similar content, check out our Scrapy Splash tutorial and Puppeteer on AWS Lambda, covering the main challenges of getting Puppeteer to work properly on AWS Lambda. As always, we’re ready to answer any questions you have via the live chat or at hello@oxylabs.io

About the author

Yelyzaveta Nechytailo

Senior Content Manager

Yelyzaveta Nechytailo is a Senior Content Manager at Oxylabs. After working as a writer in fashion, e-commerce, and media, she decided to switch her career path and immerse in the fascinating world of tech. And believe it or not, she absolutely loves it! On weekends, you’ll probably find Yelyzaveta enjoying a cup of matcha at a cozy coffee shop, scrolling through social media, or binge-watching investigative TV series.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested