Back to blog
Data is the new king of this era; that's why companies prioritize building data collection infrastructures to gather web data. However, maintaining such infrastructure is quite expensive, and one might want to seek a more cost-efficient solution, such as Scraper APIs.
When starting with a scraping service, companies also get to choose the best integration method suited to their requirements. In this article, you’ll learn about three primary Scraper APIs integration methods, their key advantages, and some tips.
Let’s start with defining integration first. In web scraping, integration is the process of incorporating a web scraping service into the entire pipeline that a company uses to scrape and parse web data. That being said, an integration method is how a company interacts with the scraping software and gets the data delivered.
All integration methods fall under two major categories – synchronous and asynchronous methods. To explain it better, let’s draw an analogy with a phone call service.
Imagine you give us a call asking to scrape some web data. We hear your request and go to process the job. As we complete the job, let's say, 10 seconds after starting it, we return the results, still on the same call line, and then end the call. This is an example of a synchronous method. In this case, you must keep the connection open once you send your job request so that we can return the results when the job is done. Sending a request and getting the results happen on the same connection.
The asynchronous integration method works differently. Continuing the analogy with the call service, you make a short phone call to us, describing the job that needs to be done. We immediately reply, confirming that the job is registered with the particular job ID. The call ends here, and the connection is closed.
When the job is complete, we make another short call to the phone number you left while registering, letting you know that the job is done. Then, you can download the results. Alternatively, you can check the job status yourself by calling us again. The critical point here is that sending a job request and downloading the results happen on separate connections, allowing you to use the line for other calls in the meantime.
Overall, the integration method and data delivery method refer to the same concept. However, depending on how you approach it, the data delivery method can signify a more narrow term. It refers to the data delivery process, which is an essential component of the synchronous method only, while the asynchronous method allows downloading data at a client’s initiative, too.
You can integrate Oxylabs Scraper APIs in three ways: Realtime, Proxy Endpoint, and Push-Pull.
Realtime is a synchronous method. Here are its key characteristics:
You have to maintain the connection open from the moment you submit your job up until we return the results of a successfully completed job or an error;
To submit a request, you need to provide a JSON describing the job that needs to be done and including such fields as the source, geolocation, query, etc.
Tip: The JSON needed for the job submission in Realtime is unique to our service and requires additional knowledge. Read our documentation to learn more.
Similarly to Realtime, the Proxy Endpoint integration method is synchronous and, for the most part, works in the same way. The biggest difference is how you provide the initial job request information. Here’s what you need to know about this integration method:
With this method, you use our Scraper API endpoint as a proxy server;
Unlike Realtime, instead of a JSON, you should provide a URL and some headers or cookies you would like to pass on to the target you want to scrape.
Tip: If you’re already familiar with proxies and how they work, you might want to opt for this integration method due to its convenience and ease. All you need to do is change the proxy address you use, and we’ll do the rest. In this case, this method requires the least changes on the client’s side.
Unlike Realtime and Proxy Endpoint, Push-Pull is an asynchronous integration method. Let’s take a look at some of its main features:
With the Push-Pull data delivery method, you don’t have to keep an open connection. Instead, you can wait for the job to be complete so you can retrieve the result at your convenience. In addition to downloading the results, it’s possible to check the job status;
Similar to Realtime, you need to submit a JSON with the job description;
If you include a callback URL in your job description, once the job is done, we’ll post a message to your callback server saying that the job has been completed, and then you can go ahead and download the results within the next 24 hours.
Compared to other integration methods, Push-Pull is a bit more complex. However, it offers richer functionality.
Single query allows you to submit a job within a single job query. Once your job is submitted, we give you your job ID and a couple of URLs – one where you can check your job status and the other where you can download the results. So, you can choose any of the two options.
With Push-Pull, you can send a single request to our system with up to 1000 scraping jobs you’d like to execute. As a response to a batch query, you’ll get the respective number of job IDs. These jobs are done independently of one another.
In most cases, when you provide a callback URL, you don’t want it to be accessible by anyone on the internet – you only want your scraping service provider to access it to send the information about the job completion. In this case, you need a list of IP addresses you would accept connections from. Since you can easily get the IP addresses of our notifier services, you can easily include them in your allow-list.
If you want the results to be uploaded to your storage, you should specify where you want to get the results delivered in your job description. This way, you don’t need to actively download any data, which makes the process easier and more efficient.
Tip: Push-Pull is a better option if you strive for higher scalability to fetch big data volumes.
Now, let’s outline the similarities and differences between each method and what they have to offer.
|Job query format||JSON||JSON||URL|
|Job status check||Yes||No||No|
|Upload to storage||Yes||No||No|
As you can see, each integration method has its peculiarities and advantages.
On average, synchronous methods (Realtime and Proxy Endpoint) are easier to implement and use, however, the need to keep the connection open all the time drains more infrastructure resources.
On the other hand, Push-Pull is harder to implement but offers better scalability and more helpful features. This method doesn’t require many infrastructure resources, but you would need to invest some engineering resources at the implementation stage.
When choosing which integration method to use with our Scraper APIs, keep in mind that these methods vary in ease of implementation and functionality. If you don’t mind going the hard way, consider Push-Pull – it’s perfect for scaling up and offers useful features, such as job status checks and storage delivery. If you want to avoid the hassle, you can pick Realtime or Proxy Endpoint.
To delve deeper into the technical details of each integration method, visit our documentation.
About the author
Maryia Stsiopkina is a Content Manager at Oxylabs. As her passion for writing was developing, she was writing either creepy detective stories or fairy tales at different points in time. Eventually, she found herself in the tech wonderland with numerous hidden corners to explore. At leisure, she does birdwatching with binoculars (some people mistake it for stalking), makes flower jewelry, and eats pickles.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
Certified data centers and upstream providers
Connect with us
Advanced proxy solutions
oxylabs.io© 2023 All Rights Reserved