Real-Time Crawler Quick Start Guide

Adelina Kiskyte

Last updated on

2020-06-19

3 min read

Oxylabs’ Real-Time Crawler is a data scraper API that provides real-time information from any website. This web crawling tool is a reliable business solution that ensures secure data delivery from your target websites. Real-Time Crawler helps to reduce costs and save human resources while gathering large-scale data.

In this guide, you will learn what you get with Real-Time Crawler, how it works, and what data delivery methods are available. You will also find information about implementing Real-Time Crawler in Python programming language.

What you get with Real-Time Crawler

Real-Time Crawler gathers data from any website and provides raw data with added features.

100% success rate – only pay for successfully delivered results.
Powered by Next Gen Residential Proxies – fast proxies powered by Artificial Intelligence and Machine Learning based algorithms.
Intelligent Proxy Rotator for block management – patented Oxylabs Proxy Rotator ensures that successful results reach you faster.
Structured data from leading e-commerce websites and search engines – receive structured data in easy-to-read JSON format.
Highly customizable – Real-Time Crawler supports high volumes of requests and allows country- and city-level targeting.
Zero maintenance – don’t worry about website changes, forget IP blocks, CAPTCHAs, and proxy management.

Real-Time Crawler users follow their usage statistics on a dashboard

What you will find in the dashboard

Real-Time Crawler users receive access to an easy-to-use dashboard, where you can keep track of your data usage and follow your billing information. You can easily get in touch with client support and get your questions answered at any time of the day.

Real-Time Crawler – how does it work?

Real-Time Crawler is easy to use and doesn’t require any specific infrastructure or resources from your end.

Send a request to Real-Time Crawler
Real-Time Crawler collects the required information from your target
Receive ready-to-use web data

Real-Time Crawler delivers ready-to-use data from your target website

Data Extraction Options

Real-Time Crawler offers two data extraction options. Use data APIs to receive structured data in JSON format from search engines and e-commerce sites. Choose HTML scraper API to carry out web crawling projects for most websites in HTML.

With both data extraction options, you can choose between Single query and Bulk options. Real-Time Crawler supports executing both individual and multiple keywords – up to 1,000 keywords with each batch.

Data APIs

Receive structured data in JSON format. E-commerce API is tailored for accessing data from e-commerce sites. Search engine API provides structured real-time data from leading search engines. Simply provide your keyword, domain of your target, pick a language and the country or a city. Extract any data, including product pages, offer listing pages, reviews, and best selling products.

HTML scraper API

Render JavaScript-heavy websites without any IP blocks, CAPTCHAs, or proxy pool management. Provide location- or device-specific requests, and we’ll handle all website changes, proxy rotation, and session control on our end.

Data delivery methods

Callback

Set up a callback server and receive a notification when your data is ready to be collected

This is the most simple yet the most reliable data delivery method. Simply send a job request, and we’ll send a notification to the callback server once the task is done. You can collect your data in the next 24-48 hours after receiving the notification.

Callback integration:

import requests
from pprint import pprint
# Structure payload.
payload = {
    'source': 'example_search',
    'domain': 'com',
    'query': 'adidas',
    'callback_url': 'https://your.callback.url',
    'storage_type': 's3',
    'storage_url': 'YOUR_BUCKET_NAME'
}
# Get response.
response = requests.request(
    'POST',
    'https://data.oxylabs.io/v1/queries',
    auth=('user', 'pass1'),
    json=payload,
)
# Print prettified response to stdout.
pprint(response.json())

In order to receive callback notifications, set up a server:

# This is a simple Sanic web server with a route listening for callbacks on localhost:8080.
# It will print job results to stdout.
import requests
from pprint import pprint
from sanic import Sanic, response

AUTH_TUPLE = ('user', 'pass1')

app = Sanic()

# Define /job_listener endpoint that accepts POST requests.
@app.route('/job_listener', methods=['POST'])
async def job_listener(request):
    try:
        res = request.json
        links = res.get('_links', [])
        for link in links:
            if link['rel'] == 'results':
                # Sanic is async, but requests are synchronous, to fully take
                # advantage of Sanic, use aiohttp.
                res_response = requests.request(
                    method='GET',
                    url=link['href'],
                    auth=AUTH_TUPLE,
                )
                pprint(res_response.json())
                break
    except Exception as e:
        print("Listener exception: {}".format(e))
    return response.json(status=200, body={'status': 'ok'})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Realtime

Submit your request and get your data back on the same open HTTPS connection in real time.

Sample:

import requests
from pprint import pprint


# Structure payload.
payload = {
    'source': 'universal',
    'url': 'https://www.example.com',
    'user_agent_type': 'mobile',
    'render': 'html',
}

# Get response.
response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('user', 'pass1'),
    json=payload,
)

# Instead of response with job status and results url, this will return the
# JSON response with results.
pprint(response.json()
)

Example response body that will be returned on open connection:
{
  "results": [
    {
      "content": "<html>
      CONTENT
      </html>"
      "created_at": "2019-10-01 00:00:01",
      "updated_at": "2019-10-01 00:00:15",
      "id": null,
      "page": 1,
      "url": "https://www.example.com/",
      "job_id": "12345678900987654321",
      "status_code": 200
    }
  ]
}

SuperAPI

SuperAPI only accepts fully-formed URLs, instead of parameters like domain and search query. That said, you can send additional information such as location and language in the request headers.

Use our entry node as a proxy, authorize with Real-Time Crawler credentials, and ignore certificates. Your data will reach you on the same open connection.

SuperAPI code sample in Python programming language:

import requests
from pprint import pprint

# Define proxy dict. Don't forget to put your real user and pass here as well.
proxies = {
  'http': 'http://user:pass1@realtime.oxylabs.io:60000',
}

response = requests.request(
    'GET',
    'https://www.example.com',
    auth=('user', 'pass1'),
    verify=False,  # Or accept our certificate.
    proxies=proxies,
)

# Print result page to stdout
pprint(response.text)

# Save returned HTML to result.html file
with open('result.html', 'w') as f:
    f.write(response.text)

HTTP

Use Real-Time Crawler in your browser via the One-Liner method. Simply provide us with your target source, the search query, domain, the number of results you want to receive, location, user agent type, and indicate if you would like your data to be parsed.

You’ll then receive a URL that you can paste into your browser and receive either parsed or HTML results.

If you want to submit your own search engine URLs to Real-Time Crawler, that is also an option. But you’ll have to URL-encode the search engine URL you submit and use the “source=example.com” parameter value. This kind of query is easier implemented with other, previously described integration methods – Realtime, Callback, or SuperAPI, than in your browser.

Integration example:

https://realtime.oxylabs.io/v1/queries?source=example_search&query=Adidas&domain=com&limit=10&
geo_location=Berlin,Germany&user_agent_type=desktop&parse=true&access_token=1234abcd

Response codes

Response	Error message	Description
204	No content	You are trying to retrieve a job that has not been completed yet
400	Multiple error messages	Bad request structure, could be a misspelled parameter or invalid value. Response body will have a more specific error message
401	‘Authorization header not provided’ / ‘Invalid authorization header’ / ‘Client not found’	Missing authorization header or incorrect login credentials
403	Forbidden	Your account does not have access to this resource
404	Not found	Job ID you are looking for is no longer available
429	Too many requests	Exceeded rate limit. Please contact your account manager to increase limits
500	Unknown error	Service unavailable
524	Timeout	Service unavailable

Conclusion

Oxylabs’ Real-Time Crawler is a web crawling tool that extracts data from any website. It’s an easy-to-use tool that only delivers successful results. Data scraping API and HTML API take care of CAPTCHAs and proxy rotation, allowing you to focus on working with ready-to-use, fresh data.

Real-Time Crawler offers four different delivery methods and is easy to integrate. All Real-Time Crawler users receive access to the client dashboard, and get access to extensive documentation.If you’re interested to see Real-Time Crawler in action, start your free trial now!

Forget about complex web scraping processes

Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.

About the author

Adelina Kiskyte

Former Senior Content Manager

Adelina Kiskyte is a former Senior Content Manager at Oxylabs. She constantly follows tech news and loves trying out new apps, even the most useless. When Adelina is not glued to her phone, she also enjoys reading self-motivation books and biographies of tech-inspired innovators. Who knows, maybe one day she will create a life-changing app of her own!

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.