Back to blog
Serverless Web Scraping with Scrapy and AWS Lambda
Yelyzaveta Nechytailo
Back to blog
Yelyzaveta Nechytailo
In today's data-driven world, effective web scraping is crucial. Managing a fleet of servers to perform web scraping operations is only sometimes the most convenient or cost-effective for engineers and developers. This is where the power of serverless scraping frameworks comes into play. This article will explore using Scrapy, a popular web scraping framework, with AWS Lambda, a serverless computing platform from Amazon Web Services. By the end, you'll have an AWS Lambda web scraper ready for deployment.
Serverless web scraping harnesses the power of serverless computing, like AWS Lambda, and a web crawling framework, such as Scrapy, to efficiently extract data from the web. By integrating Scrapy AWS Lambda, developers can create robust, scalable, and cost-effective web scraping pipelines without needing to manage any servers or pay for idle time.
What is AWS Lambda?
AWS Lambda is a serverless computing service provided by Amazon Web Services. This technology allows developers to run their code without managing servers, automatically scaling to handle any workload. In simpler terms, it takes care of all the infrastructure to run your code on a high-availability, distributed compute infrastructure, ensuring your code is optimally allocated, and the right resources are used.
The serverless aspect ensures optimal allocation of resources, allowing the application to scale up during high demand and scale down during quieter periods. This elasticity makes serverless web scraping a smart choice for projects with unpredictable load or where high-scale operations are needed sporadically. Alternatively, you may want to consider setting up Scrapy Cloud for distributed scraping.
On the other hand, using a powerful web crawling framework like Scrapy offers comprehensive tools to scrape websites effectively and with greater control. This framework enables handling complex data extraction and storing the scraped data in the desired format.
What is Scrapy?
Scrapy is an open-source and collaborative web crawling framework written in Python. It's designed to handle various web scraping tasks including web scraping and processing scraped data. With its built-in functionality for extracting and storing information in your preferred structure and format, Scrapy stands out as a robust framework for web scraping.
In order to perform web scraping with Scrapy effectively, we recommended integrating it with Oxylabs' Residential or Datacenter Proxies. The biggest advantage of integrating proxies with Scrapy is that they will allow you to hide your actual IP address from being visible to the original site's server while scraping. Using proxies protects your privacy and prevents you from being banned from the target sites when you're using an automated tool rather than manually copying and pasting data from the site.
Before getting into AWS Lambda scraping configuration, the first step towards serverless web scraping is to set up a Scrapy crawler. Scrapy uses spiders, which are self-contained crawlers that are given a set of instructions.
First, install Scrapy for your environment using pip via the terminal:
pip install scrapy
For demonstration, let's use a simple spider. Create a new Python file named games.py and paste all the code that scrapes products from a demo e-commerce site:
#====# games.py #====#
import scrapy
class GamesSpider(scrapy.Spider):
name = "games"
allowed_domains = ["sandbox.oxylabs.io"]
start_urls = ["https://sandbox.oxylabs.io/products"]
def parse(self, response):
for card in response.css(".product-card"):
yield {
"title": card.css("h4::text").get(),
"price": card.css(".price-wrapper::text").get(),
}
next_page = response.css("li.next > a::attr(href)").get()
if next_page:
yield scrapy.Request(response.urljoin(next_page))
The sample spider will start at https://sandbox.oxylabs.io/products, extract the title and price of all the games, go to the next page, and repeat. The final result is scraped data from 3,000 game products.
If you run the spider on your machine, you can use the -o switch to redirect the output to a file. For example, the following command creates a games.json file without needing to import JSON in the code file:
scrapy runspider games.py -o games.json
You cannot access the terminal or the file system when you run the Scrapy spider as an AWS Lambda function. It means that you have to send the output to a local file system and retrieve it. The alternative is to store the output in a S3 bucket. For this step, you need to have an AWS account ready.
Modify the spider code to add these custom settings:
class GamesSpider(scrapy.Spider):
name = "games"
allowed_domains = ["sandbox.oxylabs.io"]
start_urls = ["https://sandbox.oxylabs.io/products"]
custom_settings = {
"FEED_URI": "s3://YOUR-BUCKET-NAME/items.json",
"FEED_FORMAT": "json",
}
#rest of the code
Replace YOUR-BUCKET-NAME with the actual S3 bucket you created for your AWS account.
Before we can configure the local environment for AWS lambda, install the following executables:
Docker
AWS CLI
Serverless Framework
boto3 package
Configuring the Lambda function requires creating a Docker image of your Scrapy spider and uploading it to the AWS platform. Docker allows you to package an application with its environment and dependencies into a container, which can be quickly shipped and run anywhere. This process ensures that the application will run the same, regardless of any customized settings or previously installed software on the AWS Lambda server that could differ from our local development environment.
If you don't have Docker installed already, you can download and install Docker Personal from docker.com. Ensure that you can run the Docker executable from the command line.
The AWS Command Line Interface (AWS CLI) is a powerful tool that allows users to interact with Amazon Web Services (AWS) using the command-line interface of your operating system. It provides a convenient and efficient way to manage various AWS services and resources without a graphical user interface.
To install AWS CLI, visit https://aws.amazon.com/cli/ and download the package for your operating system.
For AWS Lambda to handle Scrapy, you'll need to create a Docker image of the Scrapy spider and upload it to the AWS platform.
You can install the Serverless Framework using npm. If you still need to set it up, go to the official Node.js website: https://nodejs.org/, download LTS, and install it. After that, run the following command:
$ npm install -g serverless
Botocore is a Python library developed by Amazon Web Services (AWS). It is the foundation for the AWS SDK for Python (Boto3). It is a low-level interface that provides the core functionality for interacting with AWS services through Python code.
Ensure that you have created a virtual environment and activated it, and run the following:
(venv) $ pip install boto3
Note that this virtual environment should also have Scrapy installed. Install it if you haven't installed it using pip:
(venv) $ pip install scrapy
To ensure the Docker container is manageable and contains only the necessary dependencies, creating a requirements.txt file that lists all Python packages your Scrapy spider needs to run is crucial. This file might look like this:
requests==2.31.0
requests-file==1.5.1
Scrapy==2.6.0
boto3==1.28.14
service-identity==21.1.0
cryptography==38.0.4
# more packages
The next step is to create a Docker file. You can use the Create a file option and name it Dockerfile. This file will have the following contents:
FROM public.ecr.aws/lambda/python:3.10
# Required for lxml
RUN yum install -y gcc libxml2-devel libxslt-devel
COPY . ${LAMBDA_TASK_ROOT}
RUN pip3 install -r requirements.txt
CMD [ "lambda_function.handler" ]
In this Dockerfile, we're starting with a basic Python image, setting the working directory to /app, copying our application into the Docker container, installing any necessary requirements, and setting the command to run our spider. This command is the script entrypoint.sh.
Create a new file and save it as lambda_function.py. Enter the following in this file:
import sys
def handler(event, context):
# Run the Scrapy spider
import subprocess
subprocess.run(["scrapy", "runspider", "games.py"])
return {
'statusCode': '200', # a valid HTTP status code
'body': 'Lambda function invoked',
}
Finally, for deployment, you'll need a YML file. Add a new file, save it as serverless.yml, and add the following code:
service: scrapy-lambda
provider:
name: aws
runtime: python3.9
stage: dev
region: us-east-1
environment:
BUCKET: my-bucket
iamRoleStatements:
- Effect: "Allow"
Action:
- "s3:*"
Resource: "arn:aws:s3:::${self:provider.environment.BUCKET}/*"
functions:
scrapyFunction:
image: YOUR_REPO_URI:latest
events:
- http:
path: scrape
method: post
cors: true
We'll update the YOUR_REPO_URI in the next section. Also, note that scrapy-lambda is just the name of the image we'll create in the next section.
The first step is to create a user using AWS IAM. Take note of the Access Key and Secret Access Key.
Execute the following and enter these access keys when prompted:
aws configure
Next, create a new ECR repository by running the following:
$ aws ecr create-repository --repository-name YOUR_REPO_NAME
From the JSON output, take note of the repositoryUri value. It will look like 76890223446.dkr.ecr.us-east-1.amazonaws.com/scrapy-images.
Replace the YOUR_REPO_URI in the serverless.yml file with this value.
Go to the AWS console if you still need to create an S3 bucket.
Take note of the S3 bucket name and update the Scrapy spider code. Replace YOUR-BUCKET-NAME in the games.py with the actual S3 bucket title.
Now, build your Docker image with the following command:
$ docker build -t scrapy-lambda
Tag and push your Docker image to Amazon ECR using the following commands:
$ aws ecr get-login-password --region region | docker login --username AWS --password-stdin YOUR_REPO_URI
$ docker tag scrapy-lambda:latest YOUR_REPO_URI:latest
$ docker push YOUR_REPO_URI:latest
Replace region with your AWS region and YOUR_REPO_NAME with your Amazon ECR repository.
Finally, deploy the images using the following command:
$ sls deploy
The output of this command should look as follows:
Service deployed to stack scrapy-lambda-dev (79s)
endpoint: POST - https://abcde.execute-api.us-east-1.amazonaws.com/dev/scrape
When you run the sls deploy command, you will see the service URL endpoint in the command output.
To execute this function, send a POST request to this URL as follows:
curl -X POST https://abcde.execute-api.us-east-1.amazonaws.com/dev/scrape
The Lambda function execution will begin, and the output will be stored in your S3 bucket.
There are several methods you can utilize to set up and ease debugging of your Python scripts in AWS:
Deploying after each adjustment: After every integration step or change in your serverless configuration or scraper code, run sls deploy. This approach ensures that you catch errors early in the deployment process and validate that your changes work as expected. For faster iterations, you can perform local testing using sls invoke local.
Using logging: Set up logging with print statements or Python's logging module directly in your Python script. The Amazon CloudWatch service will automatically stream these outputs to a log group named /aws/lambda/<function-name>. You can then navigate to this folder and inspect the logs. Note that CloudWatch may incur additional costs if the logging volume is high.
Invoking Lambda functions with logs: Once you have logging enabled in your Python code, use sls invoke -f <function-name> --log to run the function, fetch logs, and display them directly in your local terminal.
Exclusive events, support from experienced developers, and much more.
This article covered the topic of serverless web scraping and how to run Scrapy as a Lambda function. We discussed the prerequisites, setting up a Scrapy crawler, configuring AWS Lambda for serverless scraping, and storing the scraped information in an AWS S3 bucket. With this knowledge, you should be able to harness the power of serverless web scraping, helping you perform more efficient and cost-effective data collection.
If you're looking for more similar content, check out our Scrapy Splash tutorial and Puppeteer on AWS Lambda, covering the main challenges of getting Puppeteer to work properly on AWS Lambda. As always, we're ready to answer any questions you have via the live chat or at hello@oxylabs.io.
About the author
Yelyzaveta Nechytailo
Senior Content Manager
Yelyzaveta Nechytailo is a Senior Content Manager at Oxylabs. After working as a writer in fashion, e-commerce, and media, she decided to switch her career path and immerse in the fascinating world of tech. And believe it or not, she absolutely loves it! On weekends, you’ll probably find Yelyzaveta enjoying a cup of matcha at a cozy coffee shop, scrolling through social media, or binge-watching investigative TV series.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®