Scraping the Web With 100% Success Rate by Eivydas Vilcinskas

Gabija Fatenaite

Oct 10, 2019 6 min read

OxyCon, Oxylabs’ very first annual web data harvesting conference, was packed with in-depth talks and workshops. On the second day of the event, Eivydas Vilcinskas, Software Engineer at Oxylabs, took the stage to share some tactical advice on how to reach a 100% success rate using Oxylabs Real-Time Crawler.

According to Eivydas, 99.7% of the time, Real-Time Crawler successfully delivers data. However, as with all services, there is always that 0.03% chance of the system downtime. Fortunately, in his workshop, Eivydas walked through all possible issues and explained how to solve each and every one of them.

How to access Oxylabs’ web crawling tool, Real-Time Crawler

Before we go into error codes, let’s quickly recap how you could access the service of Real-Time Crawler.

Using SuperAPI

This method is the simplest (but also the most limited) way of accessing Real-Time Crawler. Apart from providing the target URL, you can only provide headers to select the wanted User-Agent-Type and Geo-Location to spoof. We also don’t allow the use of our javascript rendering service or submitting jobs in batches.

SuperAPI acts as a standard proxy with some added functionality and returns the body of the response from the target verbatim. We don’t wrap it in any structures like JSON or add any additional data.

Via real-time data delivery method

When using the real-time data delivery method, you POST the job, and Real-Time Crawler returns the requested data on an open connection. If done correctly, the data should come back with the HTTP status code 200 and should contain a JSON with the data you requested.

Via callback data delivery method

The callback method allows you to decide when to retrieve the requested data (but no later than after 24 hours) and lets you manage the full range of options as well as request/response timings. Check our previous blog post callback vs. real-time data delivery methods to learn more.

Scraping the web with Real-Time Crawler: error types

According to Eivydas, the majority of errors that you might encounter while integrating or using Real-Time Crawler fall into three categories: request, response, and content.

Request errors

These types of errors are related to the request path and usually arise when the signal doesn’t reach the intended destination. It might mean that the Real-Time Crawler servers are physically not reachable over the network and/or the services are not running correctly.

Response errors

Usually, response error means that the network is running smoothly, the services are available to return something, and the issue is most probably related to the way a request is made and the data it contains. For example, you might be using a wrong HTTP method for contacting the service endpoint, or you request for the data that we cannot provide.

Content errors

Once you get the data from the Real-Time Crawler, you can process it further. We report the job as completed successfully, but during the data analysis on your part, you find out that the data is not exactly as you requested, or that it has some flaws.

The most common errors and how to solve them

Depending on which type of access method you are using and which kind of error you get, there might be different ways to solve an issue. We summarized all of them in the classy table down below.

Access typeError typeError codeSolutionPlan B
SuperAPI / Real-Time / CallbackRequestServers are not reachableWait a few minutes before retryingIf the server is still down after 5 minutes, contact your account manager
SuperAPI / Real-Time / CallbackRequestReal-Time Crawler is not reachableCheck if you’re not hitting the wrong endpointTroubleshoot your connection. If the connection is ok, contact Oxylabs
SuperAPI / Real-Time / CallbackResponse400Look for the message in the body to see the reason
Real-Time / CallbackResponse401Check if you’re using correct credentials or if your user wasn’t disabled. To fix this, contact your account manager You might get this error code if the source that you’ve given to Real-Time Crawler is not supported or disabled for you. Contact your account manager to discuss implementing the necessary source in our system or to have it enabled for you.
SuperAPI / Real-Time / CallbackResponse404Check the documentation for correct endpoints
Real-Time / CallbackResponse405Use POST to submit your jobs. Any other HTTP method will return a response with this status code
SuperAPIResponse407Check if you’re not using incorrect credentialsIf you want to reset your credentials, contact your account manager
SuperAPI / Real-Time Response408Increase the default timeout value to 120 s and try againIf timeout comes from Real-Time Crawler side, contact Oxylabs
CallbackResponse408Increase the default timeout value to 30 s and try againIf timeout comes from Real-Time Crawler side, contact Oxylabs
SuperAPI / Real-Time / CallbackResponse429You have reached the limit of requests per week/month/etc. Reach out to your account manager to increase the limitYou might be making too many requests per minute. Contact your account manager to increase this limit
Real-Time / CallbackResponse5xxIf this error appears for more than 5 minutes, contact Oxylabs
SuperAPI / Real-Time / CallbackContentStatus code is not 200Retry the jobThe reason for this error might be incorrect job parameters. This status code should be handled from your side, but you can always contact Oxylabs for assistance
SuperAPI / Real-Time / CallbackContent200 but data for given parameters are incorrectTarget website might have changed their algorithms. There might be a way to get the required data by using different parameters. Contact Oxylabs for assistance
SuperAPI / Real-Time / CallbackContentCorrupted contentHave you decoded or decompressed the data? If yes and the data is still corrupted, contact Oxylabs

Structured data errors

While accessing Real-Time Crawler via callback data delivery method, you might get structured data errors. According to Eivydas, the results for structured data contain two separate status codes. The one in the root of the object contains the status code of the HTTP response that the target has given us. There is a different one in the content.parse_status_code, which marks the status of the parsing efforts:

  • 12000 – parse successful. The content should be pristine, contain all the fields that are expected to be parsed.
  • 12004 – parse successful with errors. There might be some fields that were not parsed correctly somewhere in the tree. Such fields contain the text “Could not parse xyz: ReasonForFailure”.
  • 12003 – we could not parse the content because it is not supported. For now, only the target responses of 200 are attempted to be parsed.
  • 12002 – failed the attempt to parse. A real failure on our side. This might be caused by a significant change in the HTML structure, and we have to adapt our code to continue parsing the format.

Wrapping up

So, we’ve covered the most common errors which you can encounter while integrating and using Real-Time Crawler. Reaching 100% success rate should be a piece of cake now! Moreover, detailed documentation on what Eivydas has covered in his workshop is currently in the works and will be published on our website soon. In the meantime, if you have any questions regarding Real-Time Crawler, feel free to contact us.


About Gabija Fatenaite

Gabija Fatenaite is a Product Marketing Manager at Oxylabs. Having grown up on video games and the internet, she grew to find the tech side of things more and more interesting over the years. So if you ever find yourself wanting to learn more about proxies (or video games), feel free to contact her - she’ll be more than happy to answer you.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Data Wrangling: What Is It and Why Is It Important?

Data Wrangling: What Is It and Why Is It Important?

Sep 23, 2021

5 min read

What Is Parsing of Data?

What Is Parsing of Data?

Sep 13, 2021

8 min read