Real time crawler
avatar

Adelina Kiskyte

Jun 19, 2020 7 min read

Oxylabs’ Real-Time Crawler is a data scraper API that provides real-time information from any website. This web crawling tool is a reliable business solution that ensures secure data delivery from your target websites. Real-Time Crawler helps to reduce costs and save human resources while gathering large-scale data.

In this guide, you will learn what you get with Real-Time Crawler, how it works, and what data delivery methods are available. You will also find information about implementing Real-Time Crawler in Python programming language.

Navigation:

What you get with Real-Time Crawler

Real-Time Crawler gathers data from any website and provides raw data with added features. 

  • 100% success rate – only pay for successfully delivered results. 
  • Powered by Next Gen Residential Proxies – fast proxies powered by Artificial Intelligence and Machine Learning based algorithms.
  • Intelligent Proxy Rotator for block management – patented Oxylabs Proxy Rotator ensures that successful results reach you faster.
  • Structured data from leading e-commerce websites and search engines – receive structured data in easy-to-read JSON format.
  • Highly customizable – Real-Time Crawler supports high volumes of requests and allows country- and city-level targeting.
  • Zero maintenance – don’t worry about website changes, forget IP blocks, CAPTCHAs, and proxy management.

What you will find in the dashboard

Oxylabs' Real-Time Crawler dashboard
Real-Time Crawler users follow their usage statistics on a dashboard

Real-Time Crawler users receive access to an easy-to-use dashboard, where you can keep track of your data usage and follow your billing information. You can easily get in touch with client support and get your questions answered at any time of the day.

Real-Time Crawler – how does it work?

Real-Time Crawler is easy to use and doesn’t require any specific infrastructure or resources from your end.

  1. Send a request to Real-Time Crawler
  2. Real-Time Crawler collects the required information from your target
  3. Receive ready-to-use web data
How does Real-Time Crawler work
Real-Time Crawler delivers ready-to-use data from your target website

Data Extraction Options

Real-Time Crawler offers two data extraction options. Use data APIs to receive structured data in JSON format from search engines and e-commerce sites. Choose HTML scraper API to carry out web crawling projects for most websites in HTML.

With both data extraction options, you can choose between Single query and Bulk options. Real-Time Crawler supports executing both individual and multiple keywords – up to 1,000 keywords with each batch. 

  • Data APIs

Receive structured data in JSON format. E-commerce API is tailored for accessing data from e-commerce sites. Search engine API provides structured real-time data from leading search engines. Simply provide your keyword, domain of your target, pick a language and the country or a city. Extract any data, including product pages, offer listing pages, reviews, and best selling products.

  • HTML scraper API

Render JavaScript-heavy websites without any IP blocks, CAPTCHAs, or proxy pool management. Provide location- or device-specific requests, and we’ll handle all website changes, proxy rotation, and session control on our end.

Data delivery methods 

Callback

Real-Time Crawler callback data delivery method
Set up a callback server and receive a notification when your data is ready to be collected

This is the most simple yet the most reliable data delivery method. Simply send a job request, and we’ll send a notification to the callback server once the task is done. You can collect your data in the next 24-48 hours after receiving the notification.

Callback integration:

import requests
from pprint import pprint
# Structure payload.
payload = {
    'source': 'example_search',
    'domain': 'com',
    'query': 'adidas',
    'callback_url': 'https://your.callback.url',
    'storage_type': 's3',
    'storage_url': 'YOUR_BUCKET_NAME'
}
# Get response.
response = requests.request(
    'POST',
    'https://data.oxylabs.io/v1/queries',
    auth=('user', 'pass1'),
    json=payload,
)
# Print prettified response to stdout.
pprint(response.json())

In order to receive callback notifications, set up a server:

# This is a simple Sanic web server with a route listening for callbacks on localhost:8080.
# It will print job results to stdout.
import requests
from pprint import pprint
from sanic import Sanic, response

AUTH_TUPLE = ('user', 'pass1')

app = Sanic()

# Define /job_listener endpoint that accepts POST requests.
@app.route('/job_listener', methods=['POST'])
async def job_listener(request):
    try:
        res = request.json
        links = res.get('_links', [])
        for link in links:
            if link['rel'] == 'results':
                # Sanic is async, but requests are synchronous, to fully take
                # advantage of Sanic, use aiohttp.
                res_response = requests.request(
                    method='GET',
                    url=link['href'],
                    auth=AUTH_TUPLE,
                )
                pprint(res_response.json())
                break
    except Exception as e:
        print("Listener exception: {}".format(e))
    return response.json(status=200, body={'status': 'ok'})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Realtime

Submit your request and get your data back on the same open HTTPS connection in real time.

Sample:

import requests
from pprint import pprint


# Structure payload.
payload = {
    'source': 'universal',
    'url': 'https://www.example.com',
    'user_agent_type': 'mobile',
    'render': 'html',
}

# Get response.
response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('user', 'pass1'),
    json=payload,
)

# Instead of response with job status and results url, this will return the
# JSON response with results.
pprint(response.json()
)

Example response body that will be returned on open connection:
{
  "results": [
    {
      "content": "<html>
      CONTENT
      </html>"
      "created_at": "2019-10-01 00:00:01",
      "updated_at": "2019-10-01 00:00:15",
      "id": null,
      "page": 1,
      "url": "https://www.example.com/",
      "job_id": "12345678900987654321",
      "status_code": 200
    }
  ]
}

SuperAPI

SuperAPI only accepts fully-formed URLs, instead of parameters like domain and search query. That said, you can send additional information such as location and language in the request headers. 

Use our entry node as a proxy, authorize with Real-Time Crawler credentials, and ignore certificates. Your data will reach you on the same open connection.

SuperAPI code sample in Python programming language:

import requests
from pprint import pprint

# Define proxy dict. Don't forget to put your real user and pass here as well.
proxies = {
  'http': 'http://user:[email protected]:60000',
}

response = requests.request(
    'GET',
    'https://www.example.com',
    auth=('user', 'pass1'),
    verify=False,  # Or accept our certificate.
    proxies=proxies,
)

# Print result page to stdout
pprint(response.text)

# Save returned HTML to result.html file
with open('result.html', 'w') as f:
    f.write(response.text)

HTTP

Use Real-Time Crawler in your browser via the One-Liner method. Simply provide us with your target source, the search query, domain, the number of results you want to receive, location, user agent type, and indicate if you would like your data to be parsed.

You’ll then receive a URL that you can paste into your browser and receive either parsed or HTML results. 

If you want to submit your own search engine URLs to Real-Time Crawler, that is also an option. But you’ll have to URL-encode the search engine URL you submit and use the “source=example.com” parameter value. This kind of query is easier implemented with other, previously described integration methods – Realtime, Callback, or SuperAPI, than in your browser.

Integration example:

https://realtime.oxylabs.io/v1/queries?source=example_search&query=Adidas&domain=com&limit=10&
geo_location=Berlin,Germany&user_agent_type=desktop&parse=true&access_token=1234abcd

Response codes 


Response

Error message

Description
204
No content

You are trying to retrieve a job that has not been completed yet
400
Multiple error messages

Bad request structure, could be a misspelled parameter or invalid value. Response body will have a more specific error message
401
‘Authorization header not provided’ / ‘Invalid authorization header’ / ‘Client not found’

Missing authorization header or incorrect login credentials
403Forbidden
Your account does not have access to this resource
404Not found
Job ID you are looking for is no longer available
429Too many requests
Exceeded rate limit. Please contact your account manager to increase limits
500Unknown errorService unavailable
524TimeoutService unavailable

Conclusion

Oxylabs’ Real-Time Crawler is a web crawling tool that extracts data from any website. It’s an easy-to-use tool that only delivers successful results. Data scraping API and HTML API take care of CAPTCHAs and proxy rotation, allowing you to focus on working with ready-to-use, fresh data.

Real-Time Crawler offers four different delivery methods and is easy to integrate. All Real-Time Crawler users receive access to the client dashboard, and get access to extensive documentation.If you’re interested to see Real-Time Crawler in action, start your free trial now!

avatar

About Adelina Kiskyte

Adelina Kiskyte is a Content Manager at Oxylabs. Adelina constantly follows tech news and loves trying out new apps, even the most useless. When she is not glued to her phone, she also enjoys reading self-motivation books and biographies of tech-inspired innovators. Who knows, maybe one day she will create a life-changing app of her own!

Related articles

What Is a Bot and How Does It Work?

What Is a Bot and How Does It Work?

Nov 23, 2020

8 min read

What is Web Scraping?

What is Web Scraping?

Nov 16, 2020

7 min read

How to Extract Data from A Website?

How to Extract Data from A Website?

Nov 13, 2020

10 min read

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.