Back to blog

How to Scrape Craigslist Data With Python

Danielius avatar

Danielius Radavicius

2023-10-124 min read
Share

Craigslist, one of the biggest online classified ad platforms, is a crucial source of vital business information regarding various goods and services. However, before purchasing or selling any goods and services, you may want to scrape relevant listings data from Craigslist for a detailed market analysis.

After all, manually gathering information from Craigslist can take a while. Thankfully, web scraping can help in this situation. We'll take you through the step-by-step tutorial using Python and Oxylabs Craigslist API to show you how Craigslist data can be scraped without being blocked. Features like scraping listing information, including prices, titles, and descriptions, will also be discussed.

Why scrape Craigslist?

Craigslist has a wide variety of listings in many different categories, including real estate, employment, and services. Scraping Craigslist may provide useful information for a variety of purposes:

  • Market Research: Analyze price patterns and market demand for certain goods or services in market research.

  • Lead Generation: Lead generation is the process of gathering contact details from prospective consumers or clients, such as their phone numbers and email addresses.

  • Competitive Analysis: Analyze your competitors' listings and pricing techniques.

  • Data Gathering: Gather information for analysis, research, or the creation of your Craigslist-based applications.

However, it's important to remember that scraping Craigslist might be difficult due to the security measures in place to prevent automated access, such as IP blocking and CAPTCHA challenges. If you try scraping Craigslist using ordinary techniques, you may encounter the common error: “This IP has been automatically blocked. If you have questions, please email…“.

This is where Oxylabs Craigslist Scraper comes in to help. It bypasses any Craigslist CAPTCHAs and screens the users from any IP blocks, so let’s get started!

Try free for 1 week

Request a free trial to test our Web Scraper API for your use case.

  • 5K results
  • No credit card required

Scrape Craigslist data with Oxylabs’ Craigslist API

Here is the step-by-step tutorial on scraping Craigslist listings using the Oxylabs’ Craigslist API:

Step 1: Setting up the environment

Before we start scraping, make sure you have the following libraries installed:

pip install requests beautifulsoup4 pandas

Step 2: Get Oxylabs API access

You need access to the Oxylabs API to scrape Craigslist page data without encountering IP blocks. Get API credentials by creating an Oxylabs account.

Head over to the Users tab in the Oxylabs dashboard and create an API user.

Step 3: Code in action

After getting the credentials for your API, you can start writing your code in Python.

1. Create a new Python code file and import the following libraries:

import requests
from bs4 import BeautifulSoup
import pandas as pd

2. Add code to create a payload for the Web Scraper API like this:

payload = {
   'source': 'universal',
   'url': 'https://newyork.craigslist.org/search/bka#search=1~gallery~0~1',
   'render': 'html'
}

Craigslist API is a part of the Web Scraper API by Oxylabs. To get the Craigslist data, you need to set the source as universal. Setting the URL of the target Craigslist web page and passing the html value to the render parameter will instruct the API to return the HTML content of the URL provided. We can then scrape the required data from HTML content.

3. Initialize the request and get the response in some variable.

response = requests.request(
   'POST',
   'https://realtime.oxylabs.io/v1/queries',
   auth=('<your_username>', '<your_password>'),
   json=payload,

)

4. After receiving the response, you can get the required HTML content by converting the response object into JSON format. The HTML is in the content key of this object.

result = response.json()['results']
htmlContent = result[0]['content']

5. The HTML content can be further parsed using any HTML parsing tool to extract the desired information. Let’s use BeautifulSoup to extract the product title, price, and description for demonstration. But first, you need to have a look at the HTML source of these elements from the page.

The description of the product is on the product details page. Let’s now parse the retrieved HTML and extract these elements using BeuatifulSoup.

Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(htmlContent, 'html.parser')

# Extract prices, titles, and descriptions from Craigslist listings
listings = soup.find_all('li', class_='cl-search-result cl-search-view-mode-gallery')

df = pd.DataFrame(columns=["Product Title", "Description", "Price"])

for listing in listings:
   # Extract price
   p = listing.find('span', class_='priceinfo')
   if p:
       price = p.text
   else:
       price = ""


   # Extract title
   title = listing.find('a', class_='cl-app-anchor text-only posting-title').text
   url = listing.find('a', class_='cl-app-anchor text-only posting-title').get('href')


   detailResp = requests.get(url).text

   detailSoup = BeautifulSoup(detailResp, 'html.parser')

   description_element = detailSoup.find('section', id='postingbody')
   description = ''.join(description_element.find_all(text=True, recursive=False))
   df = pd.concat(
       [pd.DataFrame([[title, description.strip(), price]], columns=df.columns), df],
       ignore_index=True,
   )

The above code extracts the product list and then loops through each product to get its title, price, and description. All this data is then saved in a dataframe.

6. Save this dataframe to CSV and JSON files using the following lines of code:

df.to_csv("craiglist_results.csv", index=False)
df.to_json("craiglist_results.json", orient="split", index=False)

Here is what the complete code looks like:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Structure payload.
payload = {
   "source": "universal",
   "url": "https://newyork.craigslist.org/search/bka#search=1~gallery~0~1",
   "render": "html",
}

# Get response.
response = requests.request(
   "POST",
   "https://realtime.oxylabs.io/v1/queries",
   auth=("<your_username>", "<your_password>"),
   json=payload,
)

# JSON response with the result.
result = response.json()["results"]
htmlContent = result[0]["content"]

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(htmlContent, "html.parser")

# Extract prices, titles, and descriptions from Craigslist listings
listings = soup.find_all("li", class_="cl-search-result cl-search-view-mode-gallery")

df = pd.DataFrame(columns=["Product Title", "Description", "Price"])
i = 1
for listing in listings:
   # Extract price
   p = listing.find("span", class_="priceinfo")
   if p:
       price = p.text
   else:
       price = ""

   # Extract title
   title = listing.find("a", class_="cl-app-anchor text-only" " posting-title").text
   url = listing.find("a", class_="cl-app-anchor " "text-only posting-title").get(
       "href"
   )

   detailResp = requests.get(url).text

   detailSoup = BeautifulSoup(detailResp, "html.parser")

   description_element = detailSoup.find("section", id="postingbody")
   description = "".join(description_element.find_all(text=True, recursive=False))

   pd.set_option("max_colwidth", 1000)
   df = pd.concat(
       [pd.DataFrame([[title, description.strip(), price]], columns=df.columns), df],
       ignore_index=True,
   )
   
# Copy the data to CSV and JSON files
df.to_csv("craiglist_results.csv", index=False)
df.to_json("craiglist_results.json", orient="split", index=False)

Here is a snippet of the output CSV file:

That’s it! You just scraped Craigslist listings without getting an IP block.

Scraping Craigslist with Oxylabs’ Residential Proxies

Now, let’s see how you can scrape Craigslist using Python's requests library along with Oxylabs' Residential Proxies. While this method gives you more control over the scraping process, it requires more manual setup. 

Step 1: Install the required libraries

First, install the necessary libraries:

pip install requests beautifulsoup4 pandas

Step 2: Set up and execute the scraper

Here’s a code sample showing how to connect through residential proxies, handle the response, and save the collected data:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random

# Configure proxy settings
proxy = {
    "http": "http://your_username:your_password@pr.oxylabs.io:7777",
    "https": "http://your_username:your_password@pr.oxylabs.io:7777"
}

# Set up headers to mimic a real browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "Referer": "https://craigslist.org/"
}

# Target URL for scraping
url = "https://newyork.craigslist.org/search/bka#search=2~gallery~0"

try:
    # Send the request through the proxy
    response = requests.get(url, headers=headers, proxies=proxy)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content
        soup = BeautifulSoup(response.text, "html.parser")
        
        # Find all listing elements
        listings = soup.find_all("li", class_="cl-search-result cl-search-view-mode-gallery")
        
        # Prepare a list to store the data
        data = []
        
        for listing in listings:
            # Extract title
            title_element = listing.find("a", class_="cl-app-anchor text-only posting-title")
            title = title_element.text.strip() if title_element else "N/A"
            
            # Extract price
            price_element = listing.find("span", class_="priceinfo")
            price = price_element.text.strip() if price_element else "N/A"
            
            # Extract link to the detailed listing page
            link = title_element["href"] if title_element and "href" in title_element.attrs else "N/A"
            
            # Add some randomized delay between requests to mimic human behavior
            time.sleep(random.uniform(1, 3))
            
            # Get detailed page content if link is available
            description = "N/A"
            if link != "N/A":
                try:
                    detail_response = requests.get(link, headers=headers, proxies=proxy)
                    if detail_response.status_code == 200:
                        detail_soup = BeautifulSoup(detail_response.text, "html.parser")
                        description_element = detail_soup.find("section", id="postingbody")
                        if description_element:
                            # Remove any "prohibited" or system messages often included in the posting body
                            for element in description_element.find_all(class_="removed"):
                                element.decompose()
                            description = description_element.get_text(strip=True)
                except Exception as e:
                    print(f"Error fetching details for {link}: {e}")
            
            # Add the data to our list
            data.append({
                "Title": title,
                "Price": price,
                "Link": link,
                "Description": description
            })
            
            print(f"Scraped: {title} - {price}")
        
        # Convert to DataFrame for easier data manipulation
        df = pd.DataFrame(data)
        
        # Save data to CSV and JSON for further analysis
        df.to_csv("craigslist_data.csv", index=False, encoding="utf-8")
        df.to_json("craigslist_data.json", orient="records", indent=4)
        
        print(f"Successfully scraped {len(data)} listings and saved to craigslist_data.csv and craigslist_data.json")
    
    else:
        print(f"Failed to access Craigslist. Status code: {response.status_code}")

except requests.exceptions.ProxyError:
    print("Proxy connection error. Please check your proxy credentials and connection.")
except requests.exceptions.ConnectionError:
    print("Connection error. Craigslist might be blocking your request.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

With this approach implemented correctly, you'll be able to successfully extract Craigslist listing data while minimizing the risk of IP blocks. Residential proxies provide a reliable solution for maintaining access to Craigslist, especially when combined with proper request patterns and thoughtful timing that mimics human browsing behavior.

Selecting a scraping approach

Wondering whether to use proxies for scraping, go without them, or employ a dedicated scraping tool? Here's a comparative table to guide your decision.

Scraping methods compared

Approach Key Features Advantages Limitations Best For
No Proxies • Basic HTTP requests
• Single IP address
• Simple request handling
• Easy to implement
• No additional costs
• Minimal setup
• Small-scale scraping
• High risk of blocks
• Limited request volume
• No geo-targeting
• Poor scalability
• Small projects
• Non-restricted content
• Testing and development
With proxies • Rotating IP addresses
• Geo-location targeting
• Session management
• High success rates
• Scalability
• Anti-ban protection
• Geographic flexibility
• Proxy management
• More costs
• Complex setup
• Needs monitoring
• Large-scale operations
• Competitor monitoring
• Global data collection
Scraper APIs • Pre-built infrastructure
• JavaScript rendering
• Parsing
• CAPTCHA handling
• Ready-to-use solution
• Maintenance-free
• Technical support
• Higher costs
• Limited customization
• API-specific limitations
• Dependency on provider
• Complex websites
• JavaScript-heavy sites
• Resource-constrained teams

Conclusion

Scraping Craigslist data is important for market analysis, lead generation, competitor analysis, and dataset acquisition purposes. However, Craigslist’s strong security measures (e.g., IP blocks and CAPTCHA screens) make the scraping tasks extremely difficult.

This is where Oxylabs’ Craigslist API comes in. It helps scrape Craigslist’s public listings at a mass scale withoSelecting a Scraping Approach

Wondering whether to use proxies for scraping, go without them, or employ a dedicated scraping tool? Here's a comparative table to guide your decision.ut getting any IP blocks and CAPTCHA issues.

Lastly, a proxy server is essential for block-free web scraping. To resemble organic traffic, you can buy proxy solutions, most notably residential proxies and datacenter IPs.

Frequently asked questions

How to avoid triggering a CAPTCHA on Craigslist?

Consider using Oxylabs' IP rotation and proxy services to avoid triggering CAPTCHA on Craigslist. It will be more difficult for Craigslist to identify automated scraping if IP addresses are constantly changing.

How do you not let Craigslist know you're scraping?

Follow these tips to prevent detection while scraping Craigslist:

  • Make use of various IP addresses.

  • Add arbitrary delays between queries to mimic human behavior.

  • To simulate multiple web browsers, use a user agent string.

Does Craigslist allow scraping?

Scraping publicly available information is legal. However, rules and regulations may change on a case-by-case basis, therefore we highly recommend receiving professional legal counsel before starting any Craigslist scraping projects. To learn more, read Is Web Scraping Legal? article.

Can I extract data from Craigslist ads?

Yes, you can because all public data from Craigslist can be scraped. However, a tool such as Craigslist ad API is highly recommended to streamline your scraping jobs for gathering ads data.

Is there a Craigslist API?

Craigslist does not offer an official public API for developers to access their data programmatically. Third-party solutions like Oxylabs' Craigslist Scraper API provide an alternative by handling the complexities of scraping while offering API-like functionality, allowing you to retrieve structured data from Craigslist listings without having to build and maintain your own scraping infrastructure.

About the author

Danielius avatar

Danielius Radavicius

Former Copywriter

Danielius Radavičius was a Copywriter at Oxylabs. Having grown up in films, music, and books and having a keen interest in the defense industry, he decided to move his career toward tech-related subjects and quickly became interested in all things technology. In his free time, you'll probably find Danielius watching films, listening to music, and planning world domination.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested