Craigslist, one of the biggest online classified ad platforms, is a crucial source of vital business information regarding various goods and services. However, before purchasing or selling any goods and services, you may want to scrape relevant listings data from Craigslist for a detailed market analysis.
After all, manually gathering information from Craigslist can take a while. Thankfully, web scraping can help in this situation. We'll take you through the step-by-step tutorial using Python and Oxylabs Craigslist API to show you how Craigslist data can be scraped without being blocked. Features like scraping listing information, including prices, titles, and descriptions, will also be discussed.
Craigslist has a wide variety of listings in many different categories, including real estate, employment, and services. Scraping Craigslist may provide useful information for a variety of purposes:
Market Research: Analyze price patterns and market demand for certain goods or services in market research.
Lead Generation: Lead generation is the process of gathering contact details from prospective consumers or clients, such as their phone numbers and email addresses.
Competitive Analysis: Analyze your competitors' listings and pricing techniques.
Data Gathering: Gather information for analysis, research, or the creation of your Craigslist-based applications.
However, it's important to remember that scraping Craigslist might be difficult due to the security measures in place to prevent automated access, such as IP blocking and CAPTCHA challenges. If you try scraping Craigslist using ordinary techniques, you may encounter the common error: “This IP has been automatically blocked. If you have questions, please email…“.
This is where Oxylabs Craigslist Scraper comes in to help. It bypasses any Craigslist CAPTCHAs and screens the users from any IP blocks, so let’s get started!
Request a free trial to test our Web Scraper API for your use case.
Here is the step-by-step tutorial on scraping Craigslist listings using the Oxylabs’ Craigslist API:
Before we start scraping, make sure you have the following libraries installed:
pip install requests beautifulsoup4 pandas
You need access to the Oxylabs API to scrape Craigslist page data without encountering IP blocks. Get API credentials by creating an Oxylabs account.
Head over to the Users tab in the Oxylabs dashboard and create an API user.
After getting the credentials for your API, you can start writing your code in Python.
1. Create a new Python code file and import the following libraries:
import requests
from bs4 import BeautifulSoup
import pandas as pd
2. Add code to create a payload for the Web Scraper API like this:
payload = {
'source': 'universal',
'url': 'https://newyork.craigslist.org/search/bka#search=1~gallery~0~1',
'render': 'html'
}
Craigslist API is a part of the Web Scraper API by Oxylabs. To get the Craigslist data, you need to set the source as universal. Setting the URL of the target Craigslist web page and passing the html value to the render parameter will instruct the API to return the HTML content of the URL provided. We can then scrape the required data from HTML content.
3. Initialize the request and get the response in some variable.
response = requests.request(
'POST',
'https://realtime.oxylabs.io/v1/queries',
auth=('<your_username>', '<your_password>'),
json=payload,
)
4. After receiving the response, you can get the required HTML content by converting the response object into JSON format. The HTML is in the content key of this object.
result = response.json()['results']
htmlContent = result[0]['content']
5. The HTML content can be further parsed using any HTML parsing tool to extract the desired information. Let’s use BeautifulSoup to extract the product title, price, and description for demonstration. But first, you need to have a look at the HTML source of these elements from the page.
The description of the product is on the product details page. Let’s now parse the retrieved HTML and extract these elements using BeuatifulSoup.
Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(htmlContent, 'html.parser')
# Extract prices, titles, and descriptions from Craigslist listings
listings = soup.find_all('li', class_='cl-search-result cl-search-view-mode-gallery')
df = pd.DataFrame(columns=["Product Title", "Description", "Price"])
for listing in listings:
# Extract price
p = listing.find('span', class_='priceinfo')
if p:
price = p.text
else:
price = ""
# Extract title
title = listing.find('a', class_='cl-app-anchor text-only posting-title').text
url = listing.find('a', class_='cl-app-anchor text-only posting-title').get('href')
detailResp = requests.get(url).text
detailSoup = BeautifulSoup(detailResp, 'html.parser')
description_element = detailSoup.find('section', id='postingbody')
description = ''.join(description_element.find_all(text=True, recursive=False))
df = pd.concat(
[pd.DataFrame([[title, description.strip(), price]], columns=df.columns), df],
ignore_index=True,
)
The above code extracts the product list and then loops through each product to get its title, price, and description. All this data is then saved in a dataframe.
6. Save this dataframe to CSV and JSON files using the following lines of code:
df.to_csv("craiglist_results.csv", index=False)
df.to_json("craiglist_results.json", orient="split", index=False)
Here is what the complete code looks like:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Structure payload.
payload = {
"source": "universal",
"url": "https://newyork.craigslist.org/search/bka#search=1~gallery~0~1",
"render": "html",
}
# Get response.
response = requests.request(
"POST",
"https://realtime.oxylabs.io/v1/queries",
auth=("<your_username>", "<your_password>"),
json=payload,
)
# JSON response with the result.
result = response.json()["results"]
htmlContent = result[0]["content"]
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(htmlContent, "html.parser")
# Extract prices, titles, and descriptions from Craigslist listings
listings = soup.find_all("li", class_="cl-search-result cl-search-view-mode-gallery")
df = pd.DataFrame(columns=["Product Title", "Description", "Price"])
i = 1
for listing in listings:
# Extract price
p = listing.find("span", class_="priceinfo")
if p:
price = p.text
else:
price = ""
# Extract title
title = listing.find("a", class_="cl-app-anchor text-only" " posting-title").text
url = listing.find("a", class_="cl-app-anchor " "text-only posting-title").get(
"href"
)
detailResp = requests.get(url).text
detailSoup = BeautifulSoup(detailResp, "html.parser")
description_element = detailSoup.find("section", id="postingbody")
description = "".join(description_element.find_all(text=True, recursive=False))
pd.set_option("max_colwidth", 1000)
df = pd.concat(
[pd.DataFrame([[title, description.strip(), price]], columns=df.columns), df],
ignore_index=True,
)
# Copy the data to CSV and JSON files
df.to_csv("craiglist_results.csv", index=False)
df.to_json("craiglist_results.json", orient="split", index=False)
Here is a snippet of the output CSV file:
That’s it! You just scraped Craigslist listings without getting an IP block.
Now, let’s see how you can scrape Craigslist using Python's requests library along with Oxylabs' Residential Proxies. While this method gives you more control over the scraping process, it requires more manual setup.
First, install the necessary libraries:
pip install requests beautifulsoup4 pandas
Here’s a code sample showing how to connect through residential proxies, handle the response, and save the collected data:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
# Configure proxy settings
proxy = {
"http": "http://your_username:your_password@pr.oxylabs.io:7777",
"https": "http://your_username:your_password@pr.oxylabs.io:7777"
}
# Set up headers to mimic a real browser
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"Referer": "https://craigslist.org/"
}
# Target URL for scraping
url = "https://newyork.craigslist.org/search/bka#search=2~gallery~0"
try:
# Send the request through the proxy
response = requests.get(url, headers=headers, proxies=proxy)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")
# Find all listing elements
listings = soup.find_all("li", class_="cl-search-result cl-search-view-mode-gallery")
# Prepare a list to store the data
data = []
for listing in listings:
# Extract title
title_element = listing.find("a", class_="cl-app-anchor text-only posting-title")
title = title_element.text.strip() if title_element else "N/A"
# Extract price
price_element = listing.find("span", class_="priceinfo")
price = price_element.text.strip() if price_element else "N/A"
# Extract link to the detailed listing page
link = title_element["href"] if title_element and "href" in title_element.attrs else "N/A"
# Add some randomized delay between requests to mimic human behavior
time.sleep(random.uniform(1, 3))
# Get detailed page content if link is available
description = "N/A"
if link != "N/A":
try:
detail_response = requests.get(link, headers=headers, proxies=proxy)
if detail_response.status_code == 200:
detail_soup = BeautifulSoup(detail_response.text, "html.parser")
description_element = detail_soup.find("section", id="postingbody")
if description_element:
# Remove any "prohibited" or system messages often included in the posting body
for element in description_element.find_all(class_="removed"):
element.decompose()
description = description_element.get_text(strip=True)
except Exception as e:
print(f"Error fetching details for {link}: {e}")
# Add the data to our list
data.append({
"Title": title,
"Price": price,
"Link": link,
"Description": description
})
print(f"Scraped: {title} - {price}")
# Convert to DataFrame for easier data manipulation
df = pd.DataFrame(data)
# Save data to CSV and JSON for further analysis
df.to_csv("craigslist_data.csv", index=False, encoding="utf-8")
df.to_json("craigslist_data.json", orient="records", indent=4)
print(f"Successfully scraped {len(data)} listings and saved to craigslist_data.csv and craigslist_data.json")
else:
print(f"Failed to access Craigslist. Status code: {response.status_code}")
except requests.exceptions.ProxyError:
print("Proxy connection error. Please check your proxy credentials and connection.")
except requests.exceptions.ConnectionError:
print("Connection error. Craigslist might be blocking your request.")
except Exception as e:
print(f"An unexpected error occurred: {e}")
With this approach implemented correctly, you'll be able to successfully extract Craigslist listing data while minimizing the risk of IP blocks. Residential proxies provide a reliable solution for maintaining access to Craigslist, especially when combined with proper request patterns and thoughtful timing that mimics human browsing behavior.
Wondering whether to use proxies for scraping, go without them, or employ a dedicated scraping tool? Here's a comparative table to guide your decision.
Approach | Key Features | Advantages | Limitations | Best For |
---|---|---|---|---|
No Proxies | • Basic HTTP requests • Single IP address • Simple request handling |
• Easy to implement • No additional costs • Minimal setup • Small-scale scraping |
• High risk of blocks • Limited request volume • No geo-targeting • Poor scalability |
• Small projects • Non-restricted content • Testing and development |
With proxies | • Rotating IP addresses • Geo-location targeting • Session management |
• High success rates • Scalability • Anti-ban protection • Geographic flexibility |
• Proxy management • More costs • Complex setup • Needs monitoring |
• Large-scale operations • Competitor monitoring • Global data collection |
Scraper APIs | • Pre-built infrastructure • JavaScript rendering • Parsing • CAPTCHA handling |
• Ready-to-use solution • Maintenance-free • Technical support |
• Higher costs • Limited customization • API-specific limitations • Dependency on provider |
• Complex websites • JavaScript-heavy sites • Resource-constrained teams |
Scraping Craigslist data is important for market analysis, lead generation, competitor analysis, and dataset acquisition purposes. However, Craigslist’s strong security measures (e.g., IP blocks and CAPTCHA screens) make the scraping tasks extremely difficult.
This is where Oxylabs’ Craigslist API comes in. It helps scrape Craigslist’s public listings at a mass scale withoSelecting a Scraping Approach
Wondering whether to use proxies for scraping, go without them, or employ a dedicated scraping tool? Here's a comparative table to guide your decision.ut getting any IP blocks and CAPTCHA issues.
Lastly, a proxy server is essential for block-free web scraping. To resemble organic traffic, you can buy proxy solutions, most notably residential proxies and datacenter IPs.
Consider using Oxylabs' IP rotation and proxy services to avoid triggering CAPTCHA on Craigslist. It will be more difficult for Craigslist to identify automated scraping if IP addresses are constantly changing.
Follow these tips to prevent detection while scraping Craigslist:
Make use of various IP addresses.
Add arbitrary delays between queries to mimic human behavior.
To simulate multiple web browsers, use a user agent string.
Scraping publicly available information is legal. However, rules and regulations may change on a case-by-case basis, therefore we highly recommend receiving professional legal counsel before starting any Craigslist scraping projects. To learn more, read Is Web Scraping Legal? article.
Yes, you can because all public data from Craigslist can be scraped. However, a tool such as Craigslist ad API is highly recommended to streamline your scraping jobs for gathering ads data.
Craigslist does not offer an official public API for developers to access their data programmatically. Third-party solutions like Oxylabs' Craigslist Scraper API provide an alternative by handling the complexities of scraping while offering API-like functionality, allowing you to retrieve structured data from Craigslist listings without having to build and maintain your own scraping infrastructure.
About the author
Danielius Radavicius
Former Copywriter
Danielius Radavičius was a Copywriter at Oxylabs. Having grown up in films, music, and books and having a keen interest in the defense industry, he decided to move his career toward tech-related subjects and quickly became interested in all things technology. In his free time, you'll probably find Danielius watching films, listening to music, and planning world domination.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®