Back to blog

How to Scrape Craigslist Data With Python

Danielius Radavicius

2023-10-123 min read
Share

Craigslist, one of the biggest online classified ad platforms, is a crucial source of vital business information regarding various goods and services. However, before purchasing or selling any goods and services, you may want to scrape relevant listings data from Craigslist for a detailed market analysis.

After all, manually gathering information from Craigslist can take a while. Thankfully, web scraping can help in this situation. We'll take you through the step-by-step tutorial using Python and Oxylabs Craigslist API to show you how Craigslist data can be scraped without being blocked. Features like scraping listing information, including prices, titles, and descriptions, will also be discussed.

Why scrape Craigslist?

Craigslist has a wide variety of listings in many different categories, including real estate, employment, and services. Scraping Craigslist may provide useful information for a variety of purposes:

  • Market Research: Analyze price patterns and market demand for certain goods or services in market research.

  • Lead Generation: Lead generation is the process of gathering contact details from prospective consumers or clients, such as their phone numbers and email addresses.

  • Competitive Analysis: Analyze your competitors' listings and pricing techniques.

  • Data Gathering: Gather information for analysis, research, or the creation of your Craigslist-based applications.

However, it's important to remember that scraping Craigslist might be difficult due to the security measures in place to prevent automated access, such as IP blocking and CAPTCHA challenges. If you try scraping Craigslist using ordinary techniques, you may encounter the common error: “This IP has been automatically blocked. If you have questions, please email…“.

This is where Oxylabs Craigslist Scraper comes in to help. It bypasses any Craigslist CAPTCHAs and screens the users from any IP blocks, so let’s get started!

Try free for 1 week

Request a free trial to test our Web Scraper API for your use case.

  • 5K results
  • No credit card required
  • Scrape Craigslist data with Oxylabs’ Craigslist API

    Here is the step-by-step tutorial on scraping Craigslist listings using the Oxylabs’ Craigslist API:

    Step 1: Setting up the environment

    Before we start scraping, make sure you have the following libraries installed:

    pip install requests beautifulsoup4 pandas

    Step 2: Get Oxylabs API access

    You need access to the Oxylabs API to scrape Craigslist data without encountering IP blocks. Get API credentials by creating an Oxylabs account.

    Head over to the Users tab in the Oxylabs dashboard and create an API user.

    Step 3: Code in action

    After getting the credentials for your API, you can start writing your code in Python.

    1. Create a new Python code file and import the following libraries:

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd

    2. Add code to create a payload for the Web Scraper API like this:

    payload = {
       'source': 'universal',
       'url': 'https://newyork.craigslist.org/search/bka#search=1~gallery~0~1',
       'render': 'html'
    }

    Craigslist API is a part of the Web SCraper API by Oxylabs. To get the Craigslist data, you need to set the source as universal. Setting the URL of the target Cariagslist page and passing the html value to the render parameter will instruct the API to return the HTML content of the URL provided. We can then scrape the required data from HTML content.

    3. Initialize the request and get the response in some variable.

    response = requests.request(
       'POST',
       'https://realtime.oxylabs.io/v1/queries',
       auth=('<your_username>', '<your_password>'),
       json=payload,
    
    )

    4. After receiving the response, you can get the required HTML content by converting the response object into JSON format. The HTML is in the content key of this object.

    result = response.json()['results']
    htmlContent = result[0]['content']

    5. The HTML content can be further parsed using any HTML parsing tool to extract the desired information. Let’s use BeautifulSoap to extract the product title, price, and description for demonstration. But first, you need to have a look at the HTML source of these elements from the page.

    The description of the product is on the product details page. Let’s now parse the retrieved HTML and extract these elements using BeuatifulSoap.

    Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(htmlContent, 'html.parser')
    
    # Extract prices, titles, and descriptions from Craigslist listings
    listings = soup.find_all('li', class_='cl-search-result cl-search-view-mode-gallery')
    
    df = pd.DataFrame(columns=["Product Title", "Description", "Price"])
    
    for listing in listings:
       # Extract price
       p = listing.find('span', class_='priceinfo')
       if p:
           price = p.text
       else:
           price = ""
    
    
       # Extract title
       title = listing.find('a', class_='cl-app-anchor text-only posting-title').text
       url = listing.find('a', class_='cl-app-anchor text-only posting-title').get('href')
    
    
       detailResp = requests.get(url).text
    
       detailSoup = BeautifulSoup(detailResp, 'html.parser')
    
       description_element = detailSoup.find('section', id='postingbody')
       description = ''.join(description_element.find_all(text=True, recursive=False))
       df = pd.concat(
           [pd.DataFrame([[title, description.strip(), price]], columns=df.columns), df],
           ignore_index=True,
       )

    The above code extracts the product list and then loops through each product to get its title, price, and description. All this data is then saved in a dataframe.

    6. Save this dataframe to CSV and JSON files using the following lines of code:

    df.to_csv("craiglist_results.csv", index=False)
    df.to_json("craiglist_results.json", orient="split", index=False)

    Here is what the complete code looks like:

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    # Structure payload.
    payload = {
       "source": "universal",
       "url": "https://newyork.craigslist.org/search/bka#search=1~gallery~0~1",
       "render": "html",
    }
    
    # Get response.
    response = requests.request(
       "POST",
       "https://realtime.oxylabs.io/v1/queries",
       auth=("<your_username>", "<your_password>"),
       json=payload,
    )
    
    # JSON response with the result.
    result = response.json()["results"]
    htmlContent = result[0]["content"]
    
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(htmlContent, "html.parser")
    
    # Extract prices, titles, and descriptions from Craigslist listings
    listings = soup.find_all("li", class_="cl-search-result cl-search-view-mode-gallery")
    
    df = pd.DataFrame(columns=["Product Title", "Description", "Price"])
    i = 1
    for listing in listings:
       # Extract price
       p = listing.find("span", class_="priceinfo")
       if p:
           price = p.text
       else:
           price = ""
    
       # Extract title
       title = listing.find("a", class_="cl-app-anchor text-only" " posting-title").text
       url = listing.find("a", class_="cl-app-anchor " "text-only posting-title").get(
           "href"
       )
    
       detailResp = requests.get(url).text
    
       detailSoup = BeautifulSoup(detailResp, "html.parser")
    
       description_element = detailSoup.find("section", id="postingbody")
       description = "".join(description_element.find_all(text=True, recursive=False))
    
       pd.set_option("max_colwidth", 1000)
       df = pd.concat(
           [pd.DataFrame([[title, description.strip(), price]], columns=df.columns), df],
           ignore_index=True,
       )
       
    # Copy the data to CSV and JSON files
    df.to_csv("craiglist_results.csv", index=False)
    df.to_json("craiglist_results.json", orient="split", index=False)

    Here is a snippet of the output CSV file:

    That’s it! You just scraped Craigslist listings without getting an IP block.

    Conclusion

    Scraping Craigslist data is important for market analysis, lead generation, competitor analysis, and dataset acquisition purposes. However, Craigslist’s strong security measures (e.g., IP blocks and CAPTCHA screens) make the scraping tasks extremely difficult.

    This is where Oxylabs’ Craigslist API comes in. It helps scrape Craigslist’s public listings at a mass scale without getting any IP blocks and CAPTCHA issues.

    Lastly, proxies are essential for block-free web scraping. To resemble organic traffic, you can buy proxy solutions, most notably residential and datacenter IPs.

    Frequently asked questions

    How to avoid triggering a CAPTCHA on Craigslist?

    Consider using Oxylabs' IP rotation and proxy services to avoid triggering CAPTCHA on Craigslist. It will be more difficult for Craigslist to identify automated scraping if IP addresses are constantly changing.

    How do you not let Craigslist know you're scraping?

    Follow these tips to prevent detection while scraping Craigslist:

    • Make use of various IP addresses.

    • Add arbitrary delays between queries to mimic human behavior.

    • To simulate multiple web browsers, use a user agent string.

    Does Craigslist allow scraping?

    Scraping publicly available information is legal. However, rules and regulations may change on a case-by-case basis, therefore we highly recommend receiving professional legal counsel before starting any Craigslist scraping projects. To learn more, read Is Web Scraping Legal? article.

    Can I extract data from Craigslist ads?

    Yes, you can because all public data from Craigslist can be scraped. However, a tool such as Craigslist ad API is highly recommended to streamline your scraping jobs for gathering ads data.

    About the author

    Danielius Radavicius

    Former Copywriter

    Danielius Radavičius was a Copywriter at Oxylabs. Having grown up in films, music, and books and having a keen interest in the defense industry, he decided to move his career toward tech-related subjects and quickly became interested in all things technology. In his free time, you'll probably find Danielius watching films, listening to music, and planning world domination.

    All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

    Related articles

    Get the latest news from data gathering world

    I’m interested