avatar

Monika Maslauskaite

Oct 06, 2021 9 min read

Oxylabs’ Web Scraper API is a data scraper API designed to collect real-time data from websites at scale. This web scraping tool serves as a trustworthy solution for gathering information from complicated targets and ensures the ease of the crawling process. Web Scraper API best fits for cases such as website changes monitoring, fraud protection, and travel fare monitoring.

In this guide, we’ll explain how Web Scraper API works and walk you through the process of getting started with this tool without hassle. 

What you get with Web Scraper API

  • Easy integration – smoothly integrate and get raw data from any data point of your chosen target. 
  • Effortless data collection – don’t spend your time on proxy management – we’ll do it for you. 
  • Unlimited scalability – do not limit your requests with a pool of more than 102 million Oxylabs proxies. 
  • Enterprise-grade solution – trust the satisfaction of more than 500 clients and rely on Oxylabs as your primary data provider.
  • 24/7 support – immediately get answers to your questions round-the-clock from our Customer Success team. 

What will you find on the dashboard

Web Scraper API users gain access to a convenient dashboard where you can keep an eye on your data usage statistics and track your subscription details. Not only that, from here you can contact our customer service team and get assistance at any time of the day. 

Data sources

Web Scraper API will deliver the page’s HTML code from most websites. You can also use JavaScript rendering capabilities to get HTML from websites that utilize JavaScript to load content dynamically.  

Web Scraper API – how does it work?

Web Scraper API is an easy-to-use tool which doesn’t need any particular infrastructure or resources from your side. 

  1. Choose target links, geo-location, and JS rendering parameters.
  2. Add custom headers and cookies, or let us manage it on our side.
  3. Submit GET or POST request.
  4. Obtain data via REST API either directly or to the cloud.

Authentication

Web Scraper API employs basic HTTP authentication which requires username and password. This is the easiest way to get started with the tool. The code example below shows how you can send a GET request to https://ip.oxylabs.io using the Realtime delivery method we’ll discuss later in this article:

curl --user "USERNAME:PASSWORD"'https://realtime.oxylabs.io/v1/queries' -H "Content-Type: application/json" -d '{"source": "universal", "url": "https://ip.oxylabs.io"}'

Integration methods

You can integrate the Web Scraper API using one of the following three methods: Push-Pull, Realtime, and SuperAPI. Let’s take a look at how each method works in more detail. 

Push-Pull

Push-Pull excels in its simplicity, meanwhile being the most reliable data delivery method. Using this approach, you provide us with your job parameters and we give you a job id that can be used to get content from /results endpoint at a later point. You are able to check if the job is completed yourself or set up a listener accepting POST requests, in which case we’d send you a callback message once the job is ready to be reclaimed. 

In addition, the Push-Pull method offers these possibilities:

  • Single Query. Our endpoint will handle single requests for one keyword or URL. The job id, together with other information, will be sent to you in an API confirmation message. This id will aid you in checking your job status manually. 
  • Check Job Status. If you include callback_url in your query, we’ll provide you a link to the data after the scraping task is finished. In case your query does not have callback_url, you’ll need to verify the job status manually by using the URL in href under rel:self in the response message. 
  • Retrieve Job Content. As soon as the job content is ready for fetching, you can get it using the URL in href under rel:results.
  • Batch Query. Web Scraper API can execute multiple keywords, up to 1,000 keywords per each batch. For this, you’ll have to post query parameters as data in the JSON body. The system will process every keyword as a separate request and return unique job ids for every request. 
  • Get Notifier IP Address List. In order to whitelist the IPs sending you callback messages, you should GET this endpoint.
  • Upload to Storage. The scraped content is stored in our databases by default. Yet, we have a custom storage feature that allows you to store results in your cloud storage so that you wouldn’t need to make any additional requests to fetch results – everything goes directly to your storage. 
  • Callback. We’ll send a callback request to your computer when the data collection task is completed and provide you with a URL to obtain the scraped data. 

In this quick start guide, we’ll provide an example of how to interact with Web Scraper API using the Push-Pull integration method and cURL library to make requests. We’ll be getting content from a test website https://ip.oxylabs.io that returns the IP address from which the request has been made. We’ll be using the United States geo-location.

Example of a single query request:

curl --user "USERNAME:PASSWORD"'https://data.oxylabs.io/v1/queries' -H "Content-Type: application/json" -d '{"source": "universal", "url": "https://ip.oxylabs.io", "geo_location": "United States"}'

Sample of the initial response output:

{
  "callback_url": null,
  "client_id": 1,
  "created_at": "2021-09-30 12:40:32",
  "domain": "io",
  "geo_location": "United States",
  "id": "6849322054852825089",
  "limit": 10,
  "locale": null,
  "pages": 1,
  "parse": false,
  "parser_type": null,
  "render": null,
  "url": "https://ip.oxylabs.io",
  "query": "",
  "source": "universal",
  "start_page": 1,
  "status": "pending",
  "storage_type": null,
  "storage_url": null,
  "subdomain": "ip",
  "content_encoding": "utf-8",
  "updated_at": "2021-09-30 12:40:32",
  "user_agent_type": "desktop",
  "session_info": null,
  "statuses": [],
  "_links": [
    {
      "rel": "self",
      "href": "http://data.oxylabs.io/v1/queries/6849322054852825089",
      "method": "GET"
    },
    {
      "rel": "results",
      "href": "http://data.oxylabs.io/v1/queries/6849322054852825089/results",
      "method": "GET"
    }
  ]
}

The initial response indicates that the job’s scrape-specific website has been created in our system. This means that it also displays all the job parameters and links where to check whether the job is complete or from where to download the contents.

In order to check whether the job is "status": "done", we can use the link from  ["_links"][0]["href"] which is http://data.oxylabs.io/v1/queries/6849322054852825089.

Example of how to check a job status:

curl --user "USERNAME:PASSWORD"
'http://data.oxylabs.io/v1/queries/6849322054852825089'

The response will contain the same data as the initial response. If the job is  "status": "done", we can retrieve the contents using the link from [“_links”][1][“href”] which is http://data.oxylabs.io/v1/queries/6849322054852825089/results.

Example of how to retrieve data:

curl --user "USERNAME:PASSWORD"
'http://data.oxylabs.io/v1/queries/6849322054852825089/results'

Sample of the response data output:

{
    "results": [
      {
        "content": "24.5.203.132\n", # Actual content from https://ip.oxylabs.io
        "created_at": "2021-09-30 12:40:32",
        "updated_at": "2021-09-30 12:40:35",
        "page": 1,
        "url": "https://ip.oxylabs.io",
        "job_id": "6849322054852825089",
        "status_code": 200
      }
    ]
}

Realtime

With this method, you can send your request and receive data back on the same open HTTPS connection straight away. 

Sample request:

curl --user
"USERNAME:PASSWORD"'https://realtime.oxylabs.io/v1/queries' -H
"Content-Type: application/json" -d '{"source": "universal", "url":
"https://ip.oxylabs.io", "geo_location": "United States"}'

Example response body that will be returned on the open connection:

{
    "results": [
      {
        "content": "24.5.203.132\n", # Actual content from https://ip.oxylabs.io
        "created_at": "2021-09-30 12:40:32",
        "updated_at": "2021-09-30 12:40:35",
        "page": 1,
        "url": "https://ip.oxylabs.io",
        "job_id": "6849322054852825089",
        "status_code": 200
      }
    ]
}

SuperAPI

Instead of parameters such as domain and search query, SuperAPI only takes completely formed URLs. Having said that, you can send extra information in the request headers, such as location and language. 

Use our entry node as a proxy, authenticate with Web Scraper API credentials, and ignore certificates. Your data will reach you on the same open connection. 

SuperAPI code sample in the Python programming language:

curl -k -x realtime.oxylabs.io:60000 -U USERNAME:PASSWORD -H
"X-Oxylabs-Geo-Location: United States" "https://ip.oxylabs.io"

Parameters*

ParameterDescriptionDefault Value
sourceData source
urlDirect URL (link) to the Universal page
user_agent_typeDevice type and browser. The full list can be found here.desktop
geo_locationGeo-location of the proxy used to retrieve the data. The full list of supported locations can be found here.
locale Locale, as expected in the Accept-Language header.
render Enables JavaScript rendering. Use it when the target requires JavaScript to load content. Only works via the Push-Pull (a.k.a. Callback) method. There are two available values for this parameter: html (get raw output) and png (get a Base64-encoded screenshot).
content_encodingAdd this parameter if you are downloading images. Learn more here.base64
context: contentBase64-encoded POST request body. It is only useful if http_method is set to post.
context:cookiesPass your own cookies.
context:follow_redirectsIndicate whether you would like the scraper to follow redirects (3xx responses with a destination URL) to get the contents of the URL at the end of the redirect chain.true
context:headersPass your own headers.
context:http_methodSet it to post if you would like to make a POST request to your target URL via Universal scraper.GET
context:session_idIf you want to use the same proxy with multiple requests, you can do so by using this parameter. Just set your session to any string you like, and we will assign a proxy to this ID and keep it for up to 10 minutes. After that, if you make another request with the same session ID, a new proxy will be assigned to that particular session ID.
context:successful_status_codesDefine a custom HTTP response code (or a few of them), upon which we should consider the scrape successful and return the content to you. May be useful if you want us to return the 503 error page or in some other non-standard cases.
callback_urlURL to your callback endpoint.
storage_typeStorage service provider. We support Amazon S3 and Google Cloud Storage. The storage_type parameter values for these storage providers are, correspondingly, s3 and gcs. The full implementation can be found on the Upload to Storage page. This feature only works via the Push-Pull (Callback) method.
storage_urlYour storage bucket name. Only works via the Push-Pull (Callback) method.
*All parameters will be provided after purchasing the product.

Response codes 

ResponseError messageDescription
204No contentYou are trying to retrieve a job that has not been completed yet.
400Multiple error messagesBad request structure, could be a misspelled parameter or invalid value. Response body will have a more specific error message.
401‘Authorization header not provided’ / ‘Invalid authorization header’ / ‘Client not found’ Missing authorization header or incorrect login credentials.
403ForbiddenYour account does not have access to this resource.
404Not foundJob ID you are looking for is no longer available.
429Too many requestsExceeded rate limit. Please contact your account manager to increase limits.
500Unknown errorService unavailable.
524Service unavailableService unavailable.
612Undefined internal errorSomething went wrong and we failed the job you submitted. You can try again at no extra cost, as we do not charge you for faulted jobs. If that does not work, please get in touch with us. 
613Faulted after too many retriesWe tried scraping the job you submitted, but gave up after reaching our retry limit.

* All parameters will be provided after purchasing the product.

Conclusion

Web Scraper API is a powerful tool allowing you to collect real-time data at scale from nearly any target you need. There are several integration methods such as Push-Pull, Realtime, and SuperAPI that serve for seamless data delivery. Like with any other Oxylabs offer, Web Scraper API comes with additional benefits, including an extensive dashboard and customer support round-the-clock. 

We hope this guide has made Web Scraper API features easier to understand and covered all the questions surrounding this product. However, if you are still not sure about any aspect of the tool, head over to our documentation for in-depth technical details or get in touch with us via support@oxylabs.io or the live chat.

avatar

About Monika Maslauskaite

Monika Maslauskaite is a Content Manager at Oxylabs. A combination of tech-world and content creation is the thing she is super passionate about in her professional path. While free of work, you’ll find her watching mystery, psychological (basically, all kinds of mind-blowing) movies, dancing, or just making up choreographies in her head.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

E-Commerce Scraper API Quick Start Guide

E-Commerce Scraper API Quick Start Guide

Oct 06, 2021

12 min read

SERP Scraper API Quick Start Guide

SERP Scraper API Quick Start Guide

Oct 06, 2021

11 min read