Back to blog

How to Scrape Indeed Jobs Data

Danielius Radavicius

2023-12-155 min read
Share

In an era where data drives decisions, accessing up-to-date job market information is crucial. Indeed.com, a leading job portal, offers extensive insights into job openings, popular roles, and company hiring trends. However, manually collecting this job data can be tedious and time-consuming. This is where web scraping, supported by the use of proxies, comes in as a game-changer, and Oxylabs' Web Scraper API makes this task seamless, efficient, and reliable.

Why Scrape Indeed?

Scraping Indeed.com allows businesses, analysts, and job seekers to stay ahead in the competitive job market. From tracking the most popular jobs to understanding industry demands, the insights gained from job postings and job details on Indeed are invaluable. Automated data collection through scraping not only saves time but also provides a more comprehensive view of the job landscape. Job scraping is a technique widely used by HR professionals.

The Tool: Oxylabs’ Web Scraper API

Oxylabs' Web Scraper API is designed to handle complex web scraping tasks with ease. It bypasses anti-bot measures, ensuring you get the job data you need without interruption. Whether you're looking to scrape job titles, company names, or detailed job descriptions, Oxylabs simplifies the process.

This step-by-step tutorial will guide you through scraping job postings from Indeed.com, focusing on extracting key job details like job titles, descriptions, and company names.

Try free for 1 week

Request a free trial to test our Web Scraper API for your use case.

  • 5K results
  • No credit card required
  • Project Setup

    You can find the following code on our GitHub.

    Prerequisites

    Before diving into the code to scrape indeed, ensure you have Python 3.8 or newer installed on your machine. This guide is written for Python 3.8+, so having a compatible version is crucial.

    Creating a Virtual Environment

    A virtual environment is an isolated space where you can install libraries and dependencies without affecting your global Python setup. It's a good practice to create one for each project. Here's how to set it up on different operating systems:

    python -m venv indeed_env #Windows
    python3 -m venv indeed_env #Mac and Linux

    Replace indeed_env with the name you'd like to give to your virtual environment.

    Activating the Virtual Environment

    Once the virtual environment is created, you'll need to activate it:

    .\indeed_env\Scripts\Activate #Windows
    source indeed_env/bin/activate #Mac and Linux

    You should see the name of your virtual environment in the terminal, indicating that it's active.

    Installing Required Libraries

    We'll use the requests library for this project to make HTTP requests. Install it by running the following command:

    pip install requests pandas

    And there you have it! Your project environment is ready for Indeed data scraping using Oxylabs' Indeed Scraper API. In the following sections, look into the Indeed structure.

    Overview of Web Scraper API

    Oxylabs' Web Scraper API allows you to extract data from many complex websites easily.

    The following is a simple example that shows how Scraper API works.

    # scraper_api_demo.py
    import requests
    
    payload = {
        "source": "universal",
        "url": "https://www.indeed.com"
    }
    
    response = requests.post(
        url="https://realtime.oxylabs.io/v1/queries",
        json=payload,
        auth=("username", "password"),
    )
    
    print(response.json())

    As you can see, the payload is where you would inform the API what and how you want to scrape.

    Save this code in a file scraper_api_demo.py and run it. You will see that the entire HTML of the page will be printed, along with some additional information from Scraper API.

    In the following section, let's examine various parameters we can send in the payload.

    Scraper API Parameters

    The most critical parameter is source. For IMDb, set the source as universal, a general-purpose source that can handle all domains.

    The parameter url is self-explanatory, a direct link to the page you want to scrape.

    The example code in the earlier section has only these two parameters. The result is, however, the entire HTML of the page. 

    Instead, what we need is parsed data. This is where the parameter parse comes into the picture. When you send parse as True, you must also send one more parameter —parsing_instructions. Combined, these two parameters allow you to get parsed data in any structure you like.

    The following allows you to parse the page title and retrieve results in JSON:

    "title": {
        "_fns": [
                    {
                        "_fn": "xpath_one", 
                        "_args": ["//title/text()"]
                    }
                ]
            }
    },

    The key _fns indicates a list of functions, which can contain one or more functions indicated by the "_fn" key, along with the arguments.

    In this example, the function is xpath_one, which takes an XPath and returns one matching element. On the other hand, the function xpath returns all matching elements.

    On similar lines are css_one and css functions that use CSS selectors instead of XPath.

    For a complete list of available functions, see the Scraper API documentation.

    The following code prints the title of the Indeed page:

    # indeed_title.py
    import requests
    
    payload = {
        "source": "universal",
        "url": "https://www.indeed.com",
        "render": "html",
        "parse": True,
        "parsing_instructions": {
            "title": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": ["//title/text()"]
                     }
                ]
            }
        }
    }
    
    response = requests.post(
        url="https://realtime.oxylabs.io/v1/queries",
        json=payload,
        auth=("username", "password")
    )
    
    print(response.json()["results"][0]["content"])

    Run this file to get the title of Indeed. 

    In the next section, we will scrape jobs from a list.

    Scraping Indeed Job Postings

    Before scraping a page, we need to examine the page structure.

    Open the Job search results in Chrome, right-click the job listing, and select Inspect.

    Move around your mouse until you can precisely select one job list item and related data.

    You can use the following CSS selector to select one job listing:

    .job_seen_beacon

    We can iterate over each matching item and get the specific job data points such as job title, company name, location, salary range, date posted, and job description.

    First, create the placeholder for job listing as follows:

    payload = {
        "source": "universal",
        "url": "https://www.indeed.com/jobs?q=work+from+home&l=San+Francisco%2C+CA",
        "render": "html",
        "parse": True,
        "parsing_instructions": {
            "job_listings": {
                "_fns": [
                    {
                        "_fn": "css",
                        "_args": [".job_seen_beacon"]
                    }
                ],

    Note the use of the function css. It means that it will return all matching elements.

    Next, we can use reserved property _items to indicate that we want to iterate over a list, further processing each list item separately.

    It will allow us to use concatenating to the path already defined as follows:

     "job_listings": {
                "_fns": [
                    {
                        "_fn": "css",
                        "_args": [".job_seen_beacon"]
                    }
                ],
                "_items": {
                    "job_title": {
                        "_fns": [
                            {
                                "_fn": "xpath_one",
                                "_args": [".//h2[contains(@class,'jobTitle')]/a/span/text()"]
                            }
                        ]
                    },
                    "company_name": {
                        "_fns": [
                            {
                                "_fn": "xpath_one",
                                "_args": [".//span[@data-testid='company-name']/text()"]
                            }
                        ]
                    },

    Similarly, we can add other selectors. After adding other details, here are the job_search_payload.json file contents:

    {
        "source": "universal",
        "url": "https://www.indeed.com/jobs?q=work+from+home&l=San+Francisco%2C+CA",
        "render": "html",
        "parse": True,
        "parsing_instructions": {
            "job_listings": {
                "_fns": [
                    {
                        "_fn": "css",
                        "_args": [".job_seen_beacon"]
                    }
                ],
                "_items": {
                    "job_title": {
                        "_fns": [
                            {
                                "_fn": "xpath_one",
                                "_args": [".//h2[contains(@class,'jobTitle')]/a/span/text()"]
                            }
                        ]
                    },
                    "company_name": {
                        "_fns": [
                            {
                                "_fn": "xpath_one",
                                "_args": [".//span[@data-testid='company-name']/text()"]
                            }
                        ]
                    },
                    "location": {
                        "_fns": [
                            {
                                "_fn": "xpath_one",
                                "_args": [".//div[@data-testid='text-location']//text()"]
                            }
                        ]
                    },
                    "salary_range": {
                        "_fns": [
                            {
                                "_fn": "xpath_one",
                                "_args": [".//div[contains(@class, 'salary-snippet-container') or contains(@class, 'estimated-salary')]//text()"]
                            }
                        ]
                    },
                    "date_posted": {
                        "_fns": [
                            {
                                "_fn": "xpath_one",
                                "_args": [".//span[@class='date']/text()"]
                            }
                        ]
                    },
                    "job_description": {
                        "_fns": [
                            {
                                "_fn": "xpath_one",
                                "_args": ["normalize-space(.//div[@class='job-snippet'])"]
                            }
                        ]
                    }
                }
            }
        }
    }

    A good way to organize your code is to save the payload as a separator JSON file. It will allow you to keep your Python file as short as follows:

    # parse_jobs.py
    import requests
    import json
    
    payload = {}
    with open("job_search_payload.json") as f:
        payload = json.load(f)
    
    response = requests.post(
        url="https://realtime.oxylabs.io/v1/queries",
        json=payload,
        auth=("username", "password"),
    )
    
    print(response.status_code)
    
    with open("result.json", "w") as f:
        json.dump(response.json(), f, indent=4)

    Exporting to JSON and CSV

    The output of Scraper API is a JSON. You can save the extracted job listing as JSON directly.

    You can use a library such as Pandas to save the job data as CSV. 

    Remember that the parsed data is stored in the content inside results.

    As we created the job listings in the key job_listings, we can use the following snippet to save the extracted indeed data:

    # parse_jobs.py
    import pandas as pd
    
    # save the indeed data as a json file and then save to CSV
    
    df = pd.DataFrame(response.json()["results"][0]["content"]["job_listings"])
    df.to_csv("job_search_results.csv", index=False)

    Conclusion

    Utilizing Web Scraper API to scrape Indeed data simplifies the task, whereas, without it, the job can be rather difficult and daunting. Notably, you can even use GUI tools such as Postman or Insomnia to scrape Indeed. You only need to send a POST request to the API with the desired payload. For even more efficient scraping, you can buy proxies to enhance your performance and avoid potential blocking. Feel free to check out our general blog post on scraping job postings in 2024, where we delve deeper into the job scraping topic and showcase how to scrape Stackshare jobs.

    The detailed documentation on Web Scraper API is available here, and if you’d like to try our Web Scraper API, you can do so for free by registering on the dashboard.

    About the author

    Danielius Radavicius

    Former Copywriter

    Danielius Radavičius was a Copywriter at Oxylabs. Having grown up in films, music, and books and having a keen interest in the defense industry, he decided to move his career toward tech-related subjects and quickly became interested in all things technology. In his free time, you'll probably find Danielius watching films, listening to music, and planning world domination.

    All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

    Related articles

    Get the latest news from data gathering world

    I’m interested