Want to check where you’ve seen that actor before? Or perhaps want to rate a movie you really enjoyed watching? We all know the first place you’ll look. Or at least the first place Google will take you.
This blog post provides insights into IMDb, its relevance for scraping, and offers a guide on efficient movie data extraction.
Request a free trial to test our Web Scraper API for your use case.
As one of the best-known entertainment data repositories, IMDb contains tons of data on movies, TV shows, and even video games. Not only is there a lot of data, but it's also extremely varied. For example, you can explore movie descriptions, cast, ratings, trivia, related movies, awards, and more. In addition to that, you’ll find user-generated data, such as reviews.
This wealth of information can be applied for a number of purposes, ranging from market research and movie recommender systems, to strategic marketing initiatives. Furthermore, user reviews present a goldmine for sentiment analysis, which can deepen insights into movie audiences.
As you’ll be writing a Python script, make sure you have Python 3.8 or newer installed on your machine. This guide is written for Python 3.8+, so having a compatible version is crucial.
A virtual environment is an isolated space where you can install libraries and dependencies without affecting your global Python setup. It's a good practice to create one for each project. Here's how to set it up on different operating systems:
python -m venv imdb_env #Windows
python3 -m venv imdb_env #Mac and Linux
Replace imdb_env with the name you'd like to give to your virtual environment.
Once the virtual environment is created, you'll need to activate it:
.\imdb_env\Scripts\Activate #Windows
source imdb_env/bin/activate #Mac and Linux
You should see the name of your virtual environment in the terminal, indicating that it's active.
We'll use the requests library for this project to make HTTP requests. Install it by running the following command:
$ pip install requests pandas
And there you have it! Your project environment is ready for IMDb data scraping. In the following sections, we'll delve deeper into the IMDb structure.
Oxylabs' Web Scraper API allows you to extract data from many complex websites easily. The following is a basic example that shows how Scraper API works.
# scraper_api_demo.py
import requests
USERNAME = "username"
PASSWORD = "password"
payload = {
"source": "universal",
"url": "https://www.imdb.com"
}
response = requests.post(
url="https://realtime.oxylabs.io/v1/queries",
json=payload,
auth=(USERNAME,PASSWORD),
)
print(response.json())
After importing requests, you need to replace the credentials with your own, which you can get by registering for a Web Scraper API subscription or getting a free trial. The payload is where you inform the API what and how you want to scrape.
Save this code in a file scraper_api_demo.py and run it. You’ll see that the entire HTML of the page will be printed, along with some additional information from Scraper API.
In the following section, let's examine various parameters we can send in the payload.
The most critical parameter is source. For IMDb, set the source as universal, which is general-purpose and can handle most domains.
The parameter url is self-explanatory, a direct link to the IMDb URLs you want to scrape. In the code discussed in the previous section, there are only two parameters. As a result, you get the entire HTML of the page.
Instead, what you need is parsed data. This is where the parameter parse comes into the picture. When you send parse as True, you must also send one more parameter — parsing_instructions. Combined, these two allow you to get parsed data in a structure you prefer.
The following allows you to get a JSON of the page title:
"title": {
"_fns": [
{
"_fn": "xpath_one",
"_args": ["//title/text()"]
}
]
}
},
If you send this as parsing_instructions, the output would be the following JSON:
{'title': 'IMDb Top 250 Movies'}
The key _fns indicates a list of functions, which can contain one or more functions indicated by the _fn key, along with the arguments.
In this example, the function is xpath_one, which takes an XPath and returns the first matching element. On the other hand, the function xpath returns all matching elements.
The functions css_one and css are similar but use CSS selectors instead of XPath. For a complete list of available functions, see the Scraper API documentation.
The following code prints the title of the IMDb page:
# imdb_title.py
import requests
USERNAME = "username"
PASSWORD = "password"
payload = {
"source": "universal",
"url": "https://www.imdb.com",
"parse": True,
"parsing_instructions": {
"title": {
"_fns": [
{
"_fn": "xpath_one",
"_args": [
"//title/text()"
]
}
]
}
},
}
response = requests.post(
url="https://realtime.oxylabs.io/v1/queries",
json=payload,
auth=(USERNAME,PASSWORD),
)
print(response.json()['results'][0]['content'])
Run this file to get the title of the IMDb page. In the next section, you’ll scrape movie data from a list.
Before scraping a page, we need to examine the page structure. Open the IMDb top 250 listing in Chrome, right-click the movie list, and select Inspect.
Move around your mouse until you can precisely select one movie list item and related data.
Inspecting an element
You can use the following XPath to select one movie detail:
//li[contains(@class,'ipc-metadata-list-summary-item')]
Also, you can iterate over these 250 items and get movie titles, year, and ratings using the same selector. Let’s see how to do it.
First, create the placeholder for movies as follows:
"movies": {
"_fns": [
{
"_fn": "xpath",
"_args": [
"//li[contains(@class,'ipc-metadata-list-summary-item')]"
]
}
],
Note the use of the function xpath. It means that it will return all matching elements.
Next, we can use reserved property _items to indicate that we want to iterate over a list, further processing each list item separately.
It will allow us to use concatenating to the path already defined as follows:
import json
payload = {
"source": "universal",
"url": "https://www.imdb.com/chart/top/?ref_=nv_mv_250",
"parse": True,
"parsing_instructions": {
"movies": {
"_fns": [
{
"_fn": "xpath",
"_args": [
"//li[contains(@class,'ipc-metadata-list-summary-item')]"
]
}
],
"_items": {
"movie_name": {
"_fns": [
{
"_fn": "xpath_one",
"_args": [
".//h3/text()"
]
}
]
},
"year":{
"_fns": [
{
"_fn": "xpath_one",
"_args": [
".//*[contains(@class,'cli-title-metadata-item')]/text()"
]
}
]
},
"rating": {
"_fns": [
{
"_fn": "xpath_one",
"_args": [
".//*[contains(@aria-label,'IMDb rating')]/text()"
]
}
]
}
}
}
}
}
with open("top_250_payload.json", 'w') as f:
json.dump(payload, f, indent=4)
Note the use of ./ in the XPath of movie_name and year. A good way to organize your code is to save the payload as a separator JSON file. It will allow you to keep your Python file short:
# parse_top_250.py
import requests
import json
USERNAME = "username"
PASSWORD = "password"
payload = {}
with open("top_250_payload.json") as f:
payload = json.load(f)
response = requests.post(
url="https://realtime.oxylabs.io/v1/queries",
json=payload,
auth=(USERNAME, PASSWORD),
)
print(response.status_code)
with open("result.json", "w") as f:
json.dump(response.json(),f, indent=4)
Code and output
That’s how it’s done! In the next section, you’ll explore how to scrape movie reviews from IMDb.
Let's scrape movie reviews of Shawshank Redemption. You’ll use CSS selectors instead of XPath this time, but the basic idea remains the same. You’ll use the css function to create a reviews node and then use the _items to extract information about the review.
First, take a look at the selectors:
The container for each review can be selected using .imdb-user-review. After that, we can use the following selectors to get various metadata:
.title for selecting the review title
.display-name-link a for reviewer name
.review-date for the review date
.content>.show-more__control for the review body
CSS selector, unlike XPath, cannot directly match the text in an element. This is where one more function from Scraper API becomes useful — element_text.
The element_text function extracts the text in the element. Scraper API allows us to chain as many functions as needed. It means we can chain css_one and element_text functions to select the data we need.
"reviews": {
"_fns": [
{
"_fn": "css",
"_args": [
".imdb-user-review"
]
}
],
"_items": {
"review_title": {
"_fns": [
{
"_fn": "css_one",
"_args": [
".title"
]
},
{
"_fn": "element_text"
}
]
},
}
Similarly, you can extract other data points. That's how the code should look so far:
{
"source": "universal",
"url": "https://www.imdb.com/title/tt0111161/reviews?ref_=tt_urv",
"parse": true,
"parsing_instructions": {
"movie_name": {
"_fns": [
{
"_fn": "css_one",
"_args": [
".parent a"
]
},
{
"_fn": "element_text"
}
]
},
"reviews": {
"_fns": [
{
"_fn": "css",
"_args": [
".imdb-user-review"
]
}
],
"_items": {
"review_title": {
"_fns": [
{
"_fn": "css_one",
"_args": [
".title"
]
},
{
"_fn": "element_text"
}
]
},
"review-body": {
"_fns": [
{
"_fn": "css_one",
"_args": [
".content>.show-more__control"
]
},
{
"_fn": "element_text"
}
]
},
"rating": {
"_fns": [
{
"_fn": "css_one",
"_args": [
".rating-other-user-rating"
]
},
{
"_fn": "element_text"
}
]
},
"name": {
"_fns": [
{
"_fn": "css_one",
"_args": [
".display-name-link a"
]
},
{
"_fn": "element_text"
}
]
},
"review_date": {
"_fns": [
{
"_fn": "css_one",
"_args": [
".review-date"
]
},
{
"_fn": "element_text"
}
]
}
}
}
}
}
Once your payload file is ready, you can use the same Python code file shown in the previous section, point to this payload, and run the code to get the results.
Comparing payload and results
The output of Scraper API is a JSON and you can save the extracted data as JSON directly. If you want a CSV file, you can use a library such as Pandas. Remember that the parsed data is stored in the content inside results.
As we created the review in the key review, we can use the following snippet to save the extracted data:
# parse_reviews.py
import json
import pandas as pd
# save results into a variable data
data = response.json()
# save the data as a json file
with open("results_reviews.json", "w") as f:
json.dump(data, f, indent=4)
# save the reviews in a CSV file
df = pd.DataFrame(data['results'][0]['content']['reviews'])
df.to_csv('reviews.csv', index=False)
Web Scraper API simplifies web scraping by taking care of the most common data gathering challenges, including managing proxies. You can use any language you like, and all you need to do is send the correct payload. If you're looking to scale your efforts, you can buy proxies to ensure efficient and uninterrupted scraping.
You might also be interested in reading up about scraping other targets such as YouTube, Google News, or Netflix.
While web scraping publicly available data from IMDb is considered to be legal, it highly depends on such factors as the target, local legislation, and how the data is going to be used. We highly recommend that you seek professional legal advice before starting any operations.
To learn more about the legality of web scraping, check here.
There are a couple of ways that you can scrape movie data. You can either build a custom scraper or buy a commercial one. While a custom one will be more flexible, you’ll have to dedicate lots of resources to bypassing anti-bot systems and parsing the data. On the other hand, a commercial solution takes care of these aspects for you.
To scrape a movie review on IMDb, you'll need to use a programming language like Python along with libraries like requests and Beautiful Soup. Alternatively, you can use Python alongside a Web Scraper API.
About the author
Enrika Pavlovskytė
Former Copywriter
Enrika Pavlovskytė was a Copywriter at Oxylabs. With a background in digital heritage research, she became increasingly fascinated with innovative technologies and started transitioning into the tech world. On her days off, you might find her camping in the wilderness and, perhaps, trying to befriend a fox! Even so, she would never pass up a chance to binge-watch old horror movies on the couch.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Yelyzaveta Nechytailo
2024-12-09
Augustas Pelakauskas
2024-12-09
Get the latest news from data gathering world
Scale up your business with Oxylabs®