Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network statusCareers

Back to blog

How to Scrape Job Postings in 2024

Gabija Fatenaite

2024-04-047 min read
Share

Job data is one of the most sought-after information when web crawling. And that should come without a surprise if you look at the employment listings and their increasing numbers. There are plenty of ways to utilize job posting data for websites and companies:

  • Providing job search aggregation sites with relevant data.

  • Using the data to analyze job trends for better recruitment strategies.

  • Using the data in combination with AI for predictive analysis.

  • Comparing competitor information, etc.

So, where do you start when it comes to job scraping? No matter how you'll be using job search aggregation data, data gathering requires scraping solutions. In this blog post, we’ll go over where to start, and which solutions work best.

Web scraping job sites: the challenges

Job board scraping: challenges

Certainly, web scraping job postings is notoriously difficult. Most of these sites use anti-scraping techniques, CAPTCHA, and dynamic content, meaning your proxies can get blocked and blacklisted quite quickly. Websites keep getting better at preventing automated activity. However, those collecting data are consequently improving at hiding their footprints as well.

Keep in mind that there are ways to reduce the risk of getting your proxies blocked ethically without breaking any website regulations. When web-scraping job sites, make sure you do it the right way. We also have a dedicated blog post explaining how to crawl a website without getting blocked.

However, the main challenge of scraping job postings comes when making a decision on how to get the data. There are a few options:

  • Building and setting up a job board scraper and/ or in-house web scraping infrastructure.

  • Investing in job scraping tools.

  • Buying job aggregation site databases. 

Of course, there are pros and cons to each option.

Building an in-house scraper

Building and maintaining an in-house scraper can be pricey, especially if you don’t have a development and data analysis team. However, you won’t need to rely on any other third party to receive the data you need.

Pros

  • Greater adaptability, allowing for tailoring a solution that is optimized to meet the specific demands of your scraping project.

  • Full control over your scraping infrastructure.

Cons

  • Building and maintaining an in-house scraper requires tremendous resource commitment.

Buying a pre-built scraper

With a ready-made tool, you save up on development team costs and maintenance, but as already mentioned – you'll be relying on someone else to perform well for you.

Pros

  • Reducing development and maintenance costs.

  • Easily scalable.

Cons

  • Less control over the scraping tool.

Buying job databases

One of the easier ways to get job postings data is simply buying pre-scraped job datasets from data companies that perform job scraping services.

Pros

  • Ease of use and acquisition.

  • No development resources needed.

Cons

  • No control over the data acquisition process.

  • There's a possibility of a dataset containing outdated information.

    As there isn't a lot to explain with the last two options, we’ll go over the first one, building and setting up a job scraper, in greater detail. 

Job posting scraping: building your own infrastructure

Job posting scraping: building your own infrastructure

If you decide to build and set up your own job scraping tool, there are a handful of steps you should take into consideration: 

  • Analyze which languages, APIs, frameworks, and libraries are the most popular and are used widely. This will save you time when making development changes in the future. 

  • Create a stable and reliable testing environment, as building a job crawler will have its challenges of its own. You should have a simple version of it as well, as the decision making will come from the business side of things, not production.

  • Data storage will become an issue, so invest in more storage centers and things about space-saving methods. 

These are just the main guidelines to take into consideration. Creating your own web crawler from scratch is a big commitment, both financially and time-wise. Thus, you may want to utilize a ready-to-use scraping tool, like Oxylabs’ Web Scraper API, that’s built with scalability and anti-blocking in mind for any type of website. Let’s see how you can easily scrape job listings with Python and Web Scraper API.

1. Set up your environment

Begin by downloading and installing Python from the official website if you don’t have it set up already. Additionally, we recommend using an integrated development environment (IDE) like PyCharm, Visual Studio Code, or similar. Once you have everything ready, open up your terminal and install the requests library using pip, the Python package installer:

python -m pip install requests

Next, create a new Python file and import the following libraries:

import requests, json, csv

The requests library will help to send HTTP requests to the API, while the json and csv libraries will process the scraped and parsed data.

2. Get a free API trial

Web Scraper API comes with a 7-day free trial, so head to the dashboard and register with a free account.

Get a free trial

Claim your 7-day free trial to test Web Scraper API.

  • 5K requests
  • No credit card needed
  • After creating your API username and password, store these credentials in a variable:

    API_credentials = ('USERNAME', 'PASSWORD')

    3. Create the API payload

    Let’s use this Stackshare jobs URL as the target website. In your Python file, create a payload dictionary, which will contain all the scraping and parsing instructions for the API:

    payload = {
        'source': 'universal',
        'url': 'https://stackshare.io/jobs',
        'geo_location': 'United States',
    }

    Notice the geo_location parameter – it tells the API to use a proxy server located in the United States, which you can also change to any other worldwide location or remove it altogether to use your own location. Head to our documentation to learn more about the parameters and integration methods you can use with Web Scraper API.

    Load more listings

    By default, Stackshare loads a total of 15 job listings, and each time you click the “Load more” button, it loads 15 more job listings:

    Clicking the "Load more" button

    With this in mind, you can simulate button clicks by utilizing the API’s Headless Browser. So, let’s add this instruction to the payload and multiply it by 13 times in order to load around 200 job postings:

    payload = {
        'source': 'universal',
        'url': 'https://stackshare.io/jobs',
        'geo_location': 'United States',
        'render': 'html',
        'browser_instructions': [
            {
                'type': 'click',
                'selector': {
                    'type': 'xpath',
                    'value': '//button[contains(text(), "Load more")]'
                }
            },
            {'type': 'wait', 'wait_time_s': 2}
        ] * 13
    }

    You can further scale this instruction by increasing the repetition count to load additional listings.

    Fetch a resource

    Instead of crafting selectors for each data point and scraping the HTML file of the Stackshare jobs web page, you can simply fetch all the data from a JSON-formatted resource. This significantly eases the scraping process and allows you to access even more data points that aren’t seen in the HTML file, such as the verified status, precise geolocation data, or job listing IDs.

    You can access the resource in question by visiting the target URL in your browser and then opening the Developer Tools. You can use the following keyboard shortcuts for that:

    • F12 or Control + Shift + I on Windows

    • Command + Option + I on macOS

    Next, head to the Network tab, filter for Fetch/XHR resources, then find the first instance of a resource that starts with query?x-algolia-agent=Algolia, and then open the resource’s Response tab to see job postings in JSON format:

    Opening the JSON resource via Developer Tools

    Afterward, open the resource’s Headers tab to see the request URL:

    Viewing the resource URL via Developer Tools

    You can instruct the API to access this resource by defining the fetch_resource function and specifying a regular expression pattern that finds a matching URL. One thing to note is that once you click the “Load more” button on the page, it’ll trigger a new request to this resource and populate the list of resources with another resource that uses exactly the same URL. Hence, you want to use the lookahead assertion to match the last occurrence of the resource that starts with query?x-algolia-agent=Algolia, which contains all of the loaded job data. Here’s the complete payload:

    payload = {
        'source': 'universal',
        'url': 'https://stackshare.io/jobs',
        'geo_location': 'United States',
        'render': 'html',
        'browser_instructions': [
            {
                'type': 'click',
                'selector': {
                    'type': 'xpath',
                    'value': '//button[contains(text(), "Load more")]'
                }
            },
            {'type': 'wait', 'wait_time_s': 2}
        ] * 13 + [
            {
                "type": "fetch_resource",
                "filter": "^(?=.*https://km8652f2eg-dsn.algolia.net/1/indexes/Jobs_production/query).*"
            }
        ]
    }

    4. Send a request to the API

    Afterward, create a response object that sends a POST request to the API while using the API credentials for authentication and passing the payload as a json object:

    response = requests.request(
        'POST',
        'https://realtime.oxylabs.io/v1/queries',
        auth=API_credentials,
        json=payload,
        timeout=180
    )
    results = response.json()['results'][0]['content']
    print(results)
    data = json.loads(results)

    Once the API returns a response, you want to parse it by accessing the scraped content from the results > content keys and then use the json module to properly load the data as a Python dictionary.

    5. Parse JSON results

    Next, create an empty jobs list and parse the JSON results to retrieve only the data you need. For this, you can use the .get() function:

    jobs = []
    for job in data['hits']:
        parsed_job = {
            'Title': job.get('title', ''),
            'Location': job.get('location', ''),
            'Remote': job.get('remote', ''),
            'Company name': job.get('company_name', ''),
            'Company website': job.get('company_website', ''),
            'Verified': job.get('company_verified', ''),
            'Apply URL': job.get('apply_url', '')
        }
        jobs.append(parsed_job)

    Here, you may also provide more instructions in case you want to extract additional information about the tools, precise geolocation, or other details.

    6. Save results to a CSV file

    For better readability, you can save the scraped and parsed results to a CSV file that’s also supported by Excel. Use the built-in csv module to achieve this:

    fieldnames = [key for key in jobs[0].keys()]
    with open('stackshare_jobs.csv', 'w') as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        for item in jobs:
            writer.writerow(item)

    After running the code, you should see a stackshare_jobs.csv file saved in your working directory. Here’s what the job data looks like once you open the file in Excel or Google Sheets:

    Viewing the saved CSV file via Google Sheets

    Full code sample

    Here’s what your Python code file should look like:

    import requests, json, csv
    
    # Use your API username and password.
    API_credentials = ('USERNAME', 'PASSWORD')
    
    # Define your browsing and scraping parameters.
    payload = {
        'source': 'universal',
        'url': 'https://stackshare.io/jobs',
        'geo_location': 'United States',
        'render': 'html',
        'browser_instructions': [
            {
                'type': 'click',
                'selector': {
                    'type': 'xpath',
                    'value': '//button[contains(text(), "Load more")]'
                }
            },
            {'type': 'wait', 'wait_time_s': 2}
        ] * 13 + [
            {
                "type": "fetch_resource",
                "filter": "^(?=.*https://km8652f2eg-dsn.algolia.net/1/indexes/Jobs_production/query).*"
            }
        ]
    }
    
    # Send a request to the API.
    response = requests.request(
        'POST',
        'https://realtime.oxylabs.io/v1/queries',
        auth=API_credentials, # Pass your API credentials.
        json=payload, # Pass the payload.
        timeout=180
    )
    
    # Get the scraped content from the complete response.
    results = response.json()['results'][0]['content']
    print(results)
    data = json.loads(results)
    
    # Parse each job posting and append the results to a list.
    jobs = []
    for job in data['hits']:
        parsed_job = {
            'Title': job.get('title', ''),
            'Location': job.get('location', ''),
            'Remote': job.get('remote', ''),
            'Company name': job.get('company_name', ''),
            'Company website': job.get('company_website', ''),
            'Verified': job.get('company_verified', ''),
            'Apply URL': job.get('apply_url', '')
        }
        jobs.append(parsed_job)
    
    # Create header names from the keys of 'jobs'.
    fieldnames = [key for key in jobs[0].keys()]
    # Save the parsed jobs to a CSV file.
    with open('stackshare_jobs.csv', 'w') as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        for item in jobs:
            writer.writerow(item)

    Alternatively, you can use proxy servers instead of relying on a third-party scraping tool. When it comes to fueling your own web crawler, deciding which proxies will work best for you comes next.

    What are Data Center Proxies

    Job scraping with proxies

    If you have your own infrastructure to scrape job postings and want to use proxies for extra help, you should go for Datacenter Proxies or Residential Proxies.

    The most common proxies for this use-case based on Oxylabs client statistics are datacenter proxies.  With generally appreciated high speeds and stability, these proxies are a go-to choice for job scraping.

    We have several blog posts on what are datacenter proxies for you to read more about, or you can check out this video where our Lead of Commercial Product Owners Nedas explains in simple yet detailed terms: 

    Residential proxies are also used when scraping job postings. Since residential proxies offer a large proxy IP pool with country and city-level targeting, they especially suit when you need to scrape job listings from data targets in very specific geolocations.

    Conclusion

    If you decide to buy a database with the necessary information for your business or you invest in a web scraper from a third party to scrape job postings, you'll save time and money on development and maintenance. However, having your own infrastructure has its benefits. If done right, it can be in the same price range, and you'll have an infrastructure you can completely rely on. Nonetheless, a pre-built scraping infrastructure can be a flexible and powerful choice – see for yourself in this article on how to scrape Google Jobs listings with Python.

    Choosing the right fuel for your web crawler will be the second most important part of this equation, so make sure you invest in a good provider with good knowledge of the market.  

    You can register to get access to residential and datacenter proxies to start job scraping right away or contact our sales team if you have any questions regarding web scraping job postings and their intricacies.

    About the author

    Gabija Fatenaite

    Lead Product Marketing Manager

    Gabija Fatenaite is a Lead Product Marketing Manager at Oxylabs. Having grown up on video games and the internet, she grew to find the tech side of things more and more interesting over the years. So if you ever find yourself wanting to learn more about proxies (or video games), feel free to contact her - she’ll be more than happy to answer you.

    All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

    Frequently Asked Questions

    What is job scraping?

    Job scraping is the automated method of collecting job postings from different websites, including such information as a job title, job description, job openings, and other web data regarding job details.

    How does job scraping work?

    Job scraping operates through automated software programs, or bots, that browse job websites and extract data.

    Is web scraping job postings legal?

    The legality of job scraping depends on various factors, including the terms of service of the website being scraped, the jurisdiction in which you operate, and how you use the scraped data.

    Do companies use web scraping?

    Yes, companies often use web scraping for various purposes such as market research, competitive analysis, lead generation, price monitoring, and data aggregation.

    Related articles

    Get the latest news from data gathering world

    I’m interested