
Oxylabs’ Real-Time Crawler is a data scraper API that provides real-time information from any website. This web crawling tool is a reliable business solution that ensures secure data delivery from your target websites. Real-Time Crawler helps to reduce costs and save human resources while gathering large-scale data.
In this guide, you will learn what you get with Real-Time Crawler, how it works, and what data delivery methods are available. You will also find information about implementing Real-Time Crawler in Python programming language.
Navigation:
- What you get with Real-Time Crawler
- What you will find in the dashboard
- Real-Time Crawler – how does it work?
- Data Extraction Options
- Data delivery methods
- Response codes
- Conclusion
What you get with Real-Time Crawler
Real-Time Crawler gathers data from any website and provides raw data with added features.
- 100% success rate – only pay for successfully delivered results.
- Powered by Next Gen Residential Proxies – fast proxies powered by Artificial Intelligence and Machine Learning based algorithms.
- Intelligent Proxy Rotator for block management – patented Oxylabs Proxy Rotator ensures that successful results reach you faster.
- Structured data from leading e-commerce websites and search engines – receive structured data in easy-to-read JSON format.
- Highly customizable – Real-Time Crawler supports high volumes of requests and allows country- and city-level targeting.
- Zero maintenance – don’t worry about website changes, forget IP blocks, CAPTCHAs, and proxy management.
What you will find in the dashboard
Real-Time Crawler users receive access to an easy-to-use dashboard, where you can keep track of your data usage and follow your billing information. You can easily get in touch with client support and get your questions answered at any time of the day.
Real-Time Crawler – how does it work?
Real-Time Crawler is easy to use and doesn’t require any specific infrastructure or resources from your end.
- Send a request to Real-Time Crawler
- Real-Time Crawler collects the required information from your target
- Receive ready-to-use web data
Data Extraction Options
Real-Time Crawler offers two data extraction options. Use data APIs to receive structured data in JSON format from search engines and e-commerce sites. Choose HTML scraper API to carry out web crawling projects for most websites in HTML.
With both data extraction options, you can choose between Single query and Bulk options. Real-Time Crawler supports executing both individual and multiple keywords – up to 1,000 keywords with each batch.
- Data APIs
Receive structured data in JSON format. E-commerce API is tailored for accessing data from e-commerce sites. Search engine API provides structured real-time data from leading search engines. Simply provide your keyword, domain of your target, pick a language and the country or a city. Extract any data, including product pages, offer listing pages, reviews, and best selling products.
- HTML scraper API
Render JavaScript-heavy websites without any IP blocks, CAPTCHAs, or proxy pool management. Provide location- or device-specific requests, and we’ll handle all website changes, proxy rotation, and session control on our end.
Data delivery methods
Callback
This is the most simple yet the most reliable data delivery method. Simply send a job request, and we’ll send a notification to the callback server once the task is done. You can collect your data in the next 24-48 hours after receiving the notification.
Callback integration:
import requests
from pprint import pprint
# Structure payload.
payload = {
'source': 'example_search',
'domain': 'com',
'query': 'adidas',
'callback_url': 'https://your.callback.url',
'storage_type': 's3',
'storage_url': 'YOUR_BUCKET_NAME'
}
# Get response.
response = requests.request(
'POST',
'https://data.oxylabs.io/v1/queries',
auth=('user', 'pass1'),
json=payload,
)
# Print prettified response to stdout.
pprint(response.json())
In order to receive callback notifications, set up a server:
# This is a simple Sanic web server with a route listening for callbacks on localhost:8080.
# It will print job results to stdout.
import requests
from pprint import pprint
from sanic import Sanic, response
AUTH_TUPLE = ('user', 'pass1')
app = Sanic()
# Define /job_listener endpoint that accepts POST requests.
@app.route('/job_listener', methods=['POST'])
async def job_listener(request):
try:
res = request.json
links = res.get('_links', [])
for link in links:
if link['rel'] == 'results':
# Sanic is async, but requests are synchronous, to fully take
# advantage of Sanic, use aiohttp.
res_response = requests.request(
method='GET',
url=link['href'],
auth=AUTH_TUPLE,
)
pprint(res_response.json())
break
except Exception as e:
print("Listener exception: {}".format(e))
return response.json(status=200, body={'status': 'ok'})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
Realtime
Submit your request and get your data back on the same open HTTPS connection in real time.
Sample:
import requests
from pprint import pprint
# Structure payload.
payload = {
'source': 'universal',
'url': 'https://www.example.com',
'user_agent_type': 'mobile',
'render': 'html',
}
# Get response.
response = requests.request(
'POST',
'https://realtime.oxylabs.io/v1/queries',
auth=('user', 'pass1'),
json=payload,
)
# Instead of response with job status and results url, this will return the
# JSON response with results.
pprint(response.json()
)
Example response body that will be returned on open connection:
{
"results": [
{
"content": "<html>
CONTENT
</html>"
"created_at": "2019-10-01 00:00:01",
"updated_at": "2019-10-01 00:00:15",
"id": null,
"page": 1,
"url": "https://www.example.com/",
"job_id": "12345678900987654321",
"status_code": 200
}
]
}
SuperAPI
SuperAPI only accepts fully-formed URLs, instead of parameters like domain and search query. That said, you can send additional information such as location and language in the request headers.
Use our entry node as a proxy, authorize with Real-Time Crawler credentials, and ignore certificates. Your data will reach you on the same open connection.
SuperAPI code sample in Python programming language:
import requests
from pprint import pprint
# Define proxy dict. Don't forget to put your real user and pass here as well.
proxies = {
'http': 'http://user:[email protected]:60000',
}
response = requests.request(
'GET',
'https://www.example.com',
auth=('user', 'pass1'),
verify=False, # Or accept our certificate.
proxies=proxies,
)
# Print result page to stdout
pprint(response.text)
# Save returned HTML to result.html file
with open('result.html', 'w') as f:
f.write(response.text)
HTTP
Use Real-Time Crawler in your browser via the One-Liner method. Simply provide us with your target source, the search query, domain, the number of results you want to receive, location, user agent type, and indicate if you would like your data to be parsed.
You’ll then receive a URL that you can paste into your browser and receive either parsed or HTML results.
If you want to submit your own search engine URLs to Real-Time Crawler, that is also an option. But you’ll have to URL-encode the search engine URL you submit and use the “source=example.com” parameter value. This kind of query is easier implemented with other, previously described integration methods – Realtime, Callback, or SuperAPI, than in your browser.
Integration example:
https://realtime.oxylabs.io/v1/queries?source=example_search&query=Adidas&domain=com&limit=10&
geo_location=Berlin,Germany&user_agent_type=desktop&parse=true&access_token=1234abcd
Response codes
Response | Error message | Description |
204 | No content | You are trying to retrieve a job that has not been completed yet |
400 | Multiple error messages | Bad request structure, could be a misspelled parameter or invalid value. Response body will have a more specific error message |
401 | ‘Authorization header not provided’ / ‘Invalid authorization header’ / ‘Client not found’ | Missing authorization header or incorrect login credentials |
403 | Forbidden | Your account does not have access to this resource |
404 | Not found | Job ID you are looking for is no longer available |
429 | Too many requests | Exceeded rate limit. Please contact your account manager to increase limits |
500 | Unknown error | Service unavailable |
524 | Timeout | Service unavailable |
Conclusion
Oxylabs’ Real-Time Crawler is a web crawling tool that extracts data from any website. It’s an easy-to-use tool that only delivers successful results. Data scraping API and HTML API take care of CAPTCHAs and proxy rotation, allowing you to focus on working with ready-to-use, fresh data.
Real-Time Crawler offers four different delivery methods and is easy to integrate. All Real-Time Crawler users receive access to the client dashboard, and get access to extensive documentation.If you’re interested to see Real-Time Crawler in action, start your free trial now!