Craigslist, one of the biggest online classified ad platforms, is a crucial source of vital business information regarding various goods and services. However, before purchasing or selling any goods and services, you may want to scrape relevant listings data from Craigslist for a detailed market analysis.
After all, manually gathering information from Craigslist can take a while. Thankfully, web scraping can help in this situation. We'll take you through the step-by-step tutorial using Python and Oxylabs Craigslist API to show you how Craigslist data can be scraped without being blocked. Features like scraping listing information, including prices, titles, and descriptions, will also be discussed.
Craigslist has a wide variety of listings in many different categories, including real estate, employment, and services. Scraping Craigslist may provide useful information for a variety of purposes:
Market Research: Analyze price patterns and market demand for certain goods or services in market research.
Lead Generation: Lead generation is the process of gathering contact details from prospective consumers or clients, such as their phone numbers and email addresses.
Competitive Analysis: Analyze your competitors' listings and pricing techniques.
Data Gathering: Gather information for analysis, research, or the creation of your Craigslist-based applications.
However, it's important to remember that scraping Craigslist might be difficult due to the security measures in place to prevent automated access, such as IP blocking and CAPTCHA challenges. If you try scraping Craigslist using ordinary techniques, you may encounter the common error: “This IP has been automatically blocked. If you have questions, please email…“.
This is where Oxylabs Craigslist Scraper comes in to help. It bypasses any Craigslist CAPTCHAs and screens the users from any IP blocks, so let’s get started!
Request a free trial to test our Web Scraper API for your use case.
Here is the step-by-step tutorial on scraping Craigslist listings using the Oxylabs’ Craigslist API:
Before we start scraping, make sure you have the following libraries installed:
pip install requests beautifulsoup4 pandas
You need access to the Oxylabs API to scrape Craigslist data without encountering IP blocks. Get API credentials by creating an Oxylabs account.
Head over to the Users tab in the Oxylabs dashboard and create an API user.
After getting the credentials for your API, you can start writing your code in Python.
1. Create a new Python code file and import the following libraries:
import requests
from bs4 import BeautifulSoup
import pandas as pd
2. Add code to create a payload for the Web Scraper API like this:
payload = {
'source': 'universal',
'url': 'https://newyork.craigslist.org/search/bka#search=1~gallery~0~1',
'render': 'html'
}
Craigslist API is a part of the Web SCraper API by Oxylabs. To get the Craigslist data, you need to set the source as universal. Setting the URL of the target Cariagslist page and passing the html value to the render parameter will instruct the API to return the HTML content of the URL provided. We can then scrape the required data from HTML content.
3. Initialize the request and get the response in some variable.
response = requests.request(
'POST',
'https://realtime.oxylabs.io/v1/queries',
auth=('<your_username>', '<your_password>'),
json=payload,
)
4. After receiving the response, you can get the required HTML content by converting the response object into JSON format. The HTML is in the content key of this object.
result = response.json()['results']
htmlContent = result[0]['content']
5. The HTML content can be further parsed using any HTML parsing tool to extract the desired information. Let’s use BeautifulSoap to extract the product title, price, and description for demonstration. But first, you need to have a look at the HTML source of these elements from the page.
The description of the product is on the product details page. Let’s now parse the retrieved HTML and extract these elements using BeuatifulSoap.
Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(htmlContent, 'html.parser')
# Extract prices, titles, and descriptions from Craigslist listings
listings = soup.find_all('li', class_='cl-search-result cl-search-view-mode-gallery')
df = pd.DataFrame(columns=["Product Title", "Description", "Price"])
for listing in listings:
# Extract price
p = listing.find('span', class_='priceinfo')
if p:
price = p.text
else:
price = ""
# Extract title
title = listing.find('a', class_='cl-app-anchor text-only posting-title').text
url = listing.find('a', class_='cl-app-anchor text-only posting-title').get('href')
detailResp = requests.get(url).text
detailSoup = BeautifulSoup(detailResp, 'html.parser')
description_element = detailSoup.find('section', id='postingbody')
description = ''.join(description_element.find_all(text=True, recursive=False))
df = pd.concat(
[pd.DataFrame([[title, description.strip(), price]], columns=df.columns), df],
ignore_index=True,
)
The above code extracts the product list and then loops through each product to get its title, price, and description. All this data is then saved in a dataframe.
6. Save this dataframe to CSV and JSON files using the following lines of code:
df.to_csv("craiglist_results.csv", index=False)
df.to_json("craiglist_results.json", orient="split", index=False)
Here is what the complete code looks like:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Structure payload.
payload = {
"source": "universal",
"url": "https://newyork.craigslist.org/search/bka#search=1~gallery~0~1",
"render": "html",
}
# Get response.
response = requests.request(
"POST",
"https://realtime.oxylabs.io/v1/queries",
auth=("<your_username>", "<your_password>"),
json=payload,
)
# JSON response with the result.
result = response.json()["results"]
htmlContent = result[0]["content"]
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(htmlContent, "html.parser")
# Extract prices, titles, and descriptions from Craigslist listings
listings = soup.find_all("li", class_="cl-search-result cl-search-view-mode-gallery")
df = pd.DataFrame(columns=["Product Title", "Description", "Price"])
i = 1
for listing in listings:
# Extract price
p = listing.find("span", class_="priceinfo")
if p:
price = p.text
else:
price = ""
# Extract title
title = listing.find("a", class_="cl-app-anchor text-only" " posting-title").text
url = listing.find("a", class_="cl-app-anchor " "text-only posting-title").get(
"href"
)
detailResp = requests.get(url).text
detailSoup = BeautifulSoup(detailResp, "html.parser")
description_element = detailSoup.find("section", id="postingbody")
description = "".join(description_element.find_all(text=True, recursive=False))
pd.set_option("max_colwidth", 1000)
df = pd.concat(
[pd.DataFrame([[title, description.strip(), price]], columns=df.columns), df],
ignore_index=True,
)
# Copy the data to CSV and JSON files
df.to_csv("craiglist_results.csv", index=False)
df.to_json("craiglist_results.json", orient="split", index=False)
Here is a snippet of the output CSV file:
That’s it! You just scraped Craigslist listings without getting an IP block.
Scraping Craigslist data is important for market analysis, lead generation, competitor analysis, and dataset acquisition purposes. However, Craigslist’s strong security measures (e.g., IP blocks and CAPTCHA screens) make the scraping tasks extremely difficult.
This is where Oxylabs’ Craigslist API comes in. It helps scrape Craigslist’s public listings at a mass scale without getting any IP blocks and CAPTCHA issues.
Lastly, proxies are essential for block-free web scraping. To resemble organic traffic, you can buy proxy solutions, most notably residential and datacenter IPs.
Consider using Oxylabs' IP rotation and proxy services to avoid triggering CAPTCHA on Craigslist. It will be more difficult for Craigslist to identify automated scraping if IP addresses are constantly changing.
Follow these tips to prevent detection while scraping Craigslist:
Make use of various IP addresses.
Add arbitrary delays between queries to mimic human behavior.
To simulate multiple web browsers, use a user agent string.
Scraping publicly available information is legal. However, rules and regulations may change on a case-by-case basis, therefore we highly recommend receiving professional legal counsel before starting any Craigslist scraping projects. To learn more, read Is Web Scraping Legal? article.
Yes, you can because all public data from Craigslist can be scraped. However, a tool such as Craigslist ad API is highly recommended to streamline your scraping jobs for gathering ads data.
About the author
Danielius Radavicius
Former Copywriter
Danielius Radavičius was a Copywriter at Oxylabs. Having grown up in films, music, and books and having a keen interest in the defense industry, he decided to move his career toward tech-related subjects and quickly became interested in all things technology. In his free time, you'll probably find Danielius watching films, listening to music, and planning world domination.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Yelyzaveta Nechytailo
2024-12-09
Augustas Pelakauskas
2024-12-09
Get the latest news from data gathering world
Scale up your business with Oxylabs®