Back to blog

Web Scraper API Quick Start Guide

Monika Maslauskaite

2023-04-066 min read
Share

Oxylabs’ Web Scraper API is a data scraper API designed to collect real-time data from websites at scale. This web scraping tool serves as a trustworthy solution for gathering information from complicated targets and ensures the ease of the crawling process. Web Scraper API best fits for cases such as website changes monitoring, fraud protection, and travel fare monitoring.

Try Web Scraper API right away with our 1-week free trial and start scraping today. Simply go to Web Scraper API page, register, and get free 5k results.

In this guide, we’ll explain how Web Scraper API works and walk you through the process of getting started with this tool without hassle. 

What you get with Web Scraper API

  • Easy integration – smoothly integrate and get raw data from any data point of your chosen target. 

  • Effortless data collection – don’t spend your time on proxy management – we’ll do it for you. 

  • Unlimited scalability – do not limit your requests with a pool of more than 102 million Oxylabs proxies. 

  • Enterprise-grade solution – trust the satisfaction of more than 500 clients and rely on Oxylabs as your primary data provider.

  • 24/7 support – immediately get answers to your questions round-the-clock from our Customer Success team. 

What will you find on the dashboard

Web Scraper API users gain access to a convenient dashboard where you can keep an eye on your data usage statistics and track your subscription details. Not only that, from here you can contact our customer service team and get assistance at any time of the day. 

Data sources

Web Scraper API will deliver the page’s HTML code from most websites. You can also use JavaScript rendering capabilities to get HTML from websites that utilize JavaScript to load content dynamically.  

Free trial & purchase information

We provide two plans – Regular and Enterprise each having four subscription options based on the number of results you wish to gather:

Regular:

  1. 1-week Free trial (5,000)

  2. Micro (17,500)

  3. Starter (38,077)

  4. Advanced (103,750)

Enterprise:

  1. Venture (226,818)

  2. Business (525,789)

  3. Corporate (1,250,000)

  4. Custom+ (10M+)

All plans, except for Corporate and Custom+, can be purchased through our self-service dashboard in just a few clicks. To purchase a Corporate or Custom+ plan, please contact our sales team. 

You will also get a Dedicated Account Manager for support when you choose the Business plan, and up. Visit the Web Scraper API pricing page for more detailed information about each plan.

Web Scraper API – how does it work?

After purchasing your desired plan, you can start using Web Scraper API right away. The setup consists of just a few simple steps:

  1. Login to the dashboard.

  2. Create an API user.

  3. Run a test query and continue setup.

Web Scraper API is an easy-to-use tool which doesn’t need any particular infrastructure or resources from your side. 

  1. Choose target links, geo-location, and JS rendering parameters.

  2. Add custom headers and cookies, or let us manage it on our side.

  3. Submit GET or POST request.

  4. Obtain data via REST API either directly or to the cloud.

For a visual example of how to use Web Scraper API for public web data scraping, check out the step-by-step tutorial below.

Authentication

Web Scraper API employs basic HTTP authentication which requires username and password. This is the easiest way to get started with the tool. The code example below shows how you can send a GET request to https://ip.oxylabs.io using the Realtime delivery method we’ll discuss later in this article.

If you are observing low success rates or retrieve empty content, please try using additional "render":"html" parameter in your request. More information about render parameter can be found here.

curl --user "USERNAME:PASSWORD"'https://realtime.oxylabs.io/v1/queries' -H "Content-Type: application/json" -d '{"source": "universal", "url": "https://ip.oxylabs.io"}'

You can try this request and start scraping right away with our free trial. Simply go to Web Scraper API page and register for a 1week free trial that offers free 5k results.

Integration methods

You can integrate the Web Scraper API using one of the following three methods: Push-Pull, Realtime, and Proxy Endpoint. Let’s take a look at how each method works in more detail. 

Push-Pull

Push-Pull is the most reliable data delivery method. Using this approach, you provide us with your job parameters and we give you a job id that can be used to get content from /results endpoint at a later point. You are able to check if the job is completed yourself or set up a listener accepting POST requests, in which case we’d send you a callback message once the job is ready to be reclaimed. 

In addition, the Push-Pull method offers these possibilities:

  • Single Query. Our endpoint will handle single requests for one keyword or URL. The job id, together with other information, will be sent to you in an API confirmation message. This id will aid you in checking your job status manually. 

  • Check Job Status. If you include callback_url in your query, we’ll provide you a link to the data after the scraping task is finished. In case your query does not have callback_url, you’ll need to verify the job status manually by using the URL in href under rel:self in the response message. 

  • Retrieve Job Content. As soon as the job content is ready for fetching, you can get it using the URL in href under rel:results.

  • Batch Query. Web Scraper API can execute multiple keywords, up to 1,000 keywords per each batch. For this, you’ll have to post query parameters as data in the JSON body. The system will process every keyword as a separate request and return unique job ids for every request. 

  • Get Notifier IP Address List. In order to whitelist the IPs sending you callback messages, you should GET this endpoint.

  • Upload to Storage. The scraped content is stored in our databases by default. Yet, we have a custom storage feature that allows you to store results in your cloud storage so that you wouldn’t need to make any additional requests to fetch results – everything goes directly to your storage. 

  • Callback. We’ll send a callback request to your computer when the data collection task is completed and provide you with a URL to obtain the scraped data. 

In this quick start guide, we’ll provide an example of how to interact with Web Scraper API using the Push-Pull integration method and cURL library to make requests. We’ll be getting content from a test website https://ip.oxylabs.io that returns the IP address from which the request has been made. We’ll be using the United States geo-location.

Example of a single query request:

curl --user "USERNAME:PASSWORD"'https://data.oxylabs.io/v1/queries' -H "Content-Type: application/json" -d '{"source": "universal", "url": "https://ip.oxylabs.io", "geo_location": "United States"}'

Sample of the initial response output:

{
  "callback_url": null,
  "client_id": 1,
  "created_at": "2021-09-30 12:40:32",
  "domain": "io",
  "geo_location": "United States",
  "id": "6849322054852825089",
  "limit": 10,
  "locale": null,
  "pages": 1,
  "parse": false,
  "parser_type": null,
  "render": null,
  "url": "https://ip.oxylabs.io",
  "query": "",
  "source": "universal",
  "start_page": 1,
  "status": "pending",
  "storage_type": null,
  "storage_url": null,
  "subdomain": "ip",
  "content_encoding": "utf-8",
  "updated_at": "2021-09-30 12:40:32",
  "user_agent_type": "desktop",
  "session_info": null,
  "statuses": [],
  "_links": [
    {
      "rel": "self",
      "href": "http://data.oxylabs.io/v1/queries/6849322054852825089",
      "method": "GET"
    },
    {
      "rel": "results",
      "href": "http://data.oxylabs.io/v1/queries/6849322054852825089/results",
      "method": "GET"
    }
  ]
}

The initial response indicates that the job’s scrape-specific website has been created in our system. This means that it also displays all the job parameters and links where to check whether the job is complete or from where to download the contents.

In order to check whether the job is "status": "done", we can use the link from  ["_links"][0]["href"] which is http://data.oxylabs.io/v1/queries/6849322054852825089.

Example of how to check a job status:

curl --user "USERNAME:PASSWORD"
'http://data.oxylabs.io/v1/queries/6849322054852825089'

The response will contain the same data as the initial response. If the job is  "status": "done", we can retrieve the contents using the link from [“_links”][1][“href”] which is http://data.oxylabs.io/v1/queries/6849322054852825089/results.

Example of how to retrieve data:

curl --user "USERNAME:PASSWORD"
'http://data.oxylabs.io/v1/queries/6849322054852825089/results'

Sample of the response data output:

{
    "results": [
      {
        "content": "24.5.203.132\n", # Actual content from https://ip.oxylabs.io
        "created_at": "2021-09-30 12:40:32",
        "updated_at": "2021-09-30 12:40:35",
        "page": 1,
        "url": "https://ip.oxylabs.io",
        "job_id": "6849322054852825089",
        "status_code": 200
      }
    ]
}

Realtime

With this method, you can send your request and receive data back on the same open HTTPS connection straight away. 

Sample request:

curl --user
"USERNAME:PASSWORD"'https://realtime.oxylabs.io/v1/queries' -H
"Content-Type: application/json" -d '{"source": "universal", "url":
"https://ip.oxylabs.io", "geo_location": "United States"}'

Example response body that will be returned on the open connection:

{
    "results": [
      {
        "content": "24.5.203.132\n", # Actual content from https://ip.oxylabs.io
        "created_at": "2021-09-30 12:40:32",
        "updated_at": "2021-09-30 12:40:35",
        "page": 1,
        "url": "https://ip.oxylabs.io",
        "job_id": "6849322054852825089",
        "status_code": 200
      }
    ]
}

Proxy Endpoint

Instead of parameters such as domain and search query, Proxy Endpoint only takes completely formed URLs. Having said that, you can send extra information in the request headers, such as location and language. 

Use our entry node as a proxy, authenticate with Web Scraper API credentials, and ignore certificates. Your data will reach you on the same open connection. 

Proxy Endpoint code sample in the Python programming language:

curl -k -x realtime.oxylabs.io:60000 -U USERNAME:PASSWORD -H
"X-Oxylabs-Geo-Location: United States" "https://ip.oxylabs.io"

GitHub

Oxylabs GitHub is the place to go for tutorials on how to scrape websites, use our tools, implement products or integrate them using the most popular programming languages (e.g. C#, Java, NodeJs, PHP, Python, etc.). Click here and check out a repository on GitHub to find the complete code used in this article.

Parameters*

ParameterDescriptionDefault Value
sourceData source
urlDirect URL (link) to the Universal page
user_agent_typeDevice type and browser. The full list can be found here.desktop
geo_locationGeo-location of the proxy used to retrieve the data. The full list of supported locations can be found here.
locale Locale, as expected in the Accept-Language header.
render Enables JavaScript rendering. Use it when the target requires JavaScript to load content. Only works via the Push-Pull (a.k.a. Callback) method. There are two available values for this parameter: html (get raw output) and png (get a Base64-encoded screenshot).
content_encodingAdd this parameter if you are downloading images. Learn more here.base64
context: contentBase64-encoded POST request body. It is only useful if http_method is set to post.
context:cookiesPass your own cookies.
context:follow_redirectsIndicate whether you would like the scraper to follow redirects (3xx responses with a destination URL) to get the contents of the URL at the end of the redirect chain.true
context:headersPass your own headers.
context:http_methodSet it to post if you would like to make a POST request to your target URL via Universal scraper.GET
context:session_idIf you want to use the same proxy with multiple requests, you can do so by using this parameter. Just set your session to any string you like, and we will assign a proxy to this ID and keep it for up to 10 minutes. After that, if you make another request with the same session ID, a new proxy will be assigned to that particular session ID.
context:successful_status_codesDefine a custom HTTP response code (or a few of them), upon which we should consider the scrape successful and return the content to you. May be useful if you want us to return the 503 error page or in some other non-standard cases.
callback_urlURL to your callback endpoint.
storage_typeStorage service provider. We support Amazon S3 and Google Cloud Storage. The storage_type parameter values for these storage providers are, correspondingly, s3 and gcs. The full implementation can be found on the Upload to Storage page. This feature only works via the Push-Pull (Callback) method.
storage_urlYour storage bucket name. Only works via the Push-Pull (Callback) method.

*All parameters will be provided after purchasing the product.

Response codes 

ResponseError messageDescription
204No contentYou are trying to retrieve a job that has not been completed yet.
400Multiple error messagesBad request structure, could be a misspelled parameter or invalid value. Response body will have a more specific error message.
401‘Authorization header not provided’ / ‘Invalid authorization header’ / ‘Client not found’ Missing authorization header or incorrect login credentials.
403ForbiddenYour account does not have access to this resource.
404Not foundJob ID you are looking for is no longer available.
429Too many requestsExceeded rate limit. Please contact your account manager to increase limits.
500Unknown errorService unavailable.
524Service unavailableService unavailable.
612Undefined internal errorSomething went wrong and we failed the job you submitted. You can try again at no extra cost, as we do not charge you for faulted jobs. If that does not work, please get in touch with us. 
613Faulted after too many retriesWe tried scraping the job you submitted, but gave up after reaching our retry limit.

Conclusion

Web Scraper API is a powerful tool allowing you to collect real-time data at scale from nearly any target you need. There are several integration methods such as Push-Pull, Realtime, and Proxy Endpoint that serve for seamless data delivery. Like with any other Oxylabs offer, Web Scraper API comes with additional benefits, including an extensive dashboard and customer support round-the-clock. 

We hope this guide has made Web Scraper API features easier to understand and covered all the questions surrounding this product. However, if you are still not sure about any aspect of the tool, head over to our documentation for in-depth technical details, visit Oxylabs GitHub or get in touch with us via support@oxylabs.io or the live chat.

About the author

Monika Maslauskaite

Former Content Manager

Monika Maslauskaite is a former Content Manager at Oxylabs. A combination of tech-world and content creation is the thing she is super passionate about in her professional path. While free of work, you’ll find her watching mystery, psychological (basically, all kinds of mind-blowing) movies, dancing, or just making up choreographies in her head.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested

IN THIS ARTICLE:


  • What you get with Web Scraper API

  • What will you find on the dashboard

  • Data sources

  • Free trial & purchase information

  • Web Scraper API – how does it work?

  • Authentication

  • Integration methods

  • GitHub

  • Parameters*

  • Response codes 

  • Conclusion

Web Scraper API for effortless data gathering

Extract data even from the most complex websites without hassle.

Scale up your business with Oxylabs®