OxyCon, Oxylabs’ very first annual web data harvesting conference, was packed with in-depth talks and workshops. On the second day of the event, Eivydas Vilcinskas, Software Engineer at Oxylabs, took the stage to share some tactical advice on how to reach a 100% success rate using Oxylabs Real-Time Crawler.
According to Eivydas, 99.7% of the time, Real-Time Crawler successfully delivers data. However, as with all services, there is always that 0.03% chance of the system downtime. Fortunately, in his workshop, Eivydas walked through all possible issues and explained how to solve each and every one of them.
How to access Oxylabs’ web crawling tool, Real-Time Crawler
Before we go into error codes, let’s quickly recap how you could access the service of Real-Time Crawler.
SuperAPI acts as a standard proxy with some added functionality and returns the body of the response from the target verbatim. We don’t wrap it in any structures like JSON or add any additional data.
Via real-time data delivery method
When using the real-time data delivery method, you POST the job, and Real-Time Crawler returns the requested data on an open connection. If done correctly, the data should come back with the HTTP status code 200 and should contain a JSON with the data you requested.
Via callback data delivery method
The callback method allows you to decide when to retrieve the requested data (but no later than after 24 hours) and lets you manage the full range of options as well as request/response timings. Check our previous blog post callback vs. real-time data delivery methods to learn more.
Scraping the web with Real-Time Crawler: error types
According to Eivydas, the majority of errors that you might encounter while integrating or using Real-Time Crawler fall into three categories: request, response, and content.
These types of errors are related to the request path and usually arise when the signal doesn’t reach the intended destination. It might mean that the Real-Time Crawler servers are physically not reachable over the network and/or the services are not running correctly.
Usually, response error means that the network is running smoothly, the services are available to return something, and the issue is most probably related to the way a request is made and the data it contains. For example, you might be using a wrong HTTP method for contacting the service endpoint, or you request for the data that we cannot provide.
Once you get the data from the Real-Time Crawler, you can process it further. We report the job as completed successfully, but during the data analysis on your part, you find out that the data is not exactly as you requested, or that it has some flaws.
The most common errors and how to solve them
Depending on which type of access method you are using and which kind of error you get, there might be different ways to solve an issue. We summarized all of them in the classy table down below.
|Access type||Error type||Error code||Solution||Plan B|
|SuperAPI / Real-Time / Callback||Request||Servers are not reachable||Wait a few minutes before retrying||If the server is still down after 5 minutes, contact your account manager|
|SuperAPI / Real-Time / Callback||Request||Real-Time Crawler is not reachable||Check if you’re not hitting the wrong endpoint||Troubleshoot your connection. If the connection is ok, contact Oxylabs|
|SuperAPI / Real-Time / Callback||Response||400||Look for the message in the body to see the reason||–|
|Real-Time / Callback||Response||401||Check if you’re using correct credentials or if your user wasn’t disabled. To fix this, contact your account manager||You might get this error code if the source that you’ve given to Real-Time Crawler is not supported or disabled for you. Contact your account manager to discuss implementing the necessary source in our system or to have it enabled for you.|
|SuperAPI / Real-Time / Callback||Response||404||Check the documentation for correct endpoints||–|
|Real-Time / Callback||Response||405||Use POST to submit your jobs. Any other HTTP method will return a response with this status code||–|
|SuperAPI||Response||407||Check if you’re not using incorrect credentials||If you want to reset your credentials, contact your account manager|
|SuperAPI / Real-Time||Response||408||Increase the default timeout value to 120 s and try again||If timeout comes from Real-Time Crawler side, contact Oxylabs|
|Callback||Response||408||Increase the default timeout value to 30 s and try again||If timeout comes from Real-Time Crawler side, contact Oxylabs|
|SuperAPI / Real-Time / Callback||Response||429||You have reached the limit of requests per week/month/etc. Reach out to your account manager to increase the limit||You might be making too many requests per minute. Contact your account manager to increase this limit|
|Real-Time / Callback||Response||5xx||If this error appears for more than 5 minutes, contact Oxylabs||–|
|SuperAPI / Real-Time / Callback||Content||Status code is not 200||Retry the job||The reason for this error might be incorrect job parameters. This status code should be handled from your side, but you can always contact Oxylabs for assistance|
|SuperAPI / Real-Time / Callback||Content||200 but data for given parameters are incorrect||Target website might have changed their algorithms. There might be a way to get the required data by using different parameters. Contact Oxylabs for assistance||–|
|SuperAPI / Real-Time / Callback||Content||Corrupted content||Have you decoded or decompressed the data? If yes and the data is still corrupted, contact Oxylabs||–|
Structured data errors
While accessing Real-Time Crawler via callback data delivery method, you might get structured data errors. According to Eivydas, the results for structured data contain two separate status codes. The one in the root of the object contains the status code of the HTTP response that the target has given us. There is a different one in the
content.parse_status_code, which marks the status of the parsing efforts:
- 12000 – parse successful. The content should be pristine, contain all the fields that are expected to be parsed.
- 12004 – parse successful with errors. There might be some fields that were not parsed correctly somewhere in the tree. Such fields contain the text “Could not parse
- 12003 – we could not parse the content because it is not supported. For now, only the target responses of 200 are attempted to be parsed.
- 12002 – failed the attempt to parse. A real failure on our side. This might be caused by a significant change in the HTML structure, and we have to adapt our code to continue parsing the format.
So, we’ve covered the most common errors which you can encounter while integrating and using Real-Time Crawler. Reaching 100% success rate should be a piece of cake now! Moreover, detailed documentation on what Eivydas has covered in his workshop is currently in the works and will be published on our website soon. In the meantime, if you have any questions regarding Real-Time Crawler, feel free to contact us.