Back to blog

Web Scraping SDK: Definition and Benefits at a Glance

Web Scraping SDK: Definition and Benefits at a Glance

Vytenis Kaubrė

2024-08-145 min read
Share

Building software quickly and efficiently without leaving major pitfalls is no picnic. That's why modern engineering leverages Software Development Kits (SDKs), which include everything needed to simplify and accelerate development processes for specific platforms, operating systems, or frameworks. With pre-built functionalities and streamlined processes of repetitive coding tasks, programmers can focus on what truly matters rather than reinventing the wheel.

In this article, you’ll learn about the intricacies of SDKs and their role in web scraping. The second part focuses on getting started with Oxylabs SDKs and a side-by-side comparison with Oxylabs' Web Scraper API to showcase SDK development advantages.

What is an SDK?

An SDK is a set of tools, libraries, debuggers, code samples, documentation, and pre-written code wrappers, providing shortcuts for development. Instead of you having to think of the logic, learn to use the tools and libraries in advanced ways, and write the entire code yourself, an SDK has most of it under the hood. It abstracts away much of the low-level coding and allows developers to focus on higher-level functionality.

Commonly, SDKs include APIs for integration and development tools like compilers and debuggers, ensuring consistency across applications. As such, SDKs enable overall efficiency and effectiveness in rapidly building applications, in turn reducing development time and associated costs.

As an example, think of iOS SDKs, which provide easy access to Apple’s operating system functionality for building iOS applications. These SDKs standardize the feel and behavior of the apps themselves, ensuring a consistent user experience across the platform.

The role of SDKs in web scraping

Custom scraper vs. API

Before discussing web scraping SDKs, let’s briefly overview two common web scraping approaches:

  • Custom web scraper: building a custom scraper requires expert knowledge of libraries, error handling, overcoming anti-scraping challenges (IP blocks, CAPTCHAs, fingerprinting), handling dynamic content, data parsing, and scaling operations. It’s a highly flexible approach, allowing developers to build a public data gathering tool with all needs in mind, but requiring professional programming skills and continuous maintenance.

  • Web scraper API: a scraper API is specifically designed to eliminate typical scraping complexities by providing a dedicated tool that bypasses anti-scraping measures, parses the data, scales easily, and requires no maintenance. It’s a reliable interface that allows developers to focus on managing and utilizing the data instead of dealing with difficult websites and maintaining the underlying infrastructure.

SDK vs. API

Oftentimes, web scraping SDKs take control of handling APIs and streamline the entire coding workflow. Such SDKs abstract away certain API difficulties and offer the following ease of use to the user:

  • HTTP(S) requests: handles connection tasks like sending HTTP requests using synchronous and asynchronous methods, manages sessions, cookies, and authentication. Additionally, with an SDK, users don’t have to manage different API endpoint addresses.

  • Responses: SDKs can also simplify response processing by automatically validating responses, handling status codes, and dealing with data structures.

  • Errors and logging: robust error handling and logging techniques are usually included with SDKs, allowing programmers to debug and monitor their activities with meaningful error messages without requiring custom coding.

  • Scalability: while scraper APIs already provide a way to easily scale operations, SDKs simplify it further by managing requests and offering shortcuts.

With all this in mind, SDKs grant the following benefits over using APIs directly:

  • Faster development, cutting out repetitive code and library usage by replacing it with simple and easy-to-use functions.

  • Fewer errors due to streamlined development processes, ensuring there’s not much room for user error.

  • Better performance of the end-product as SDKs are built with best practices and performance in mind.

Introducing Oxylabs SDKs

We’ve designed Oxylabs SDKs for Python and Go, allowing you to streamline your integration process with Oxylabs SERP API and E-Commerce API. In turn, you can take advantage of these three key benefits:

  • User-friendly interface for easy interaction with the APIs.

  • Automated handling of API requests and responses

  • Easier troubleshooting with common API error management and clear error messages.

Refer to the next section for a quick start with the SDK in Python and a direct comparison to the Oxylabs API that achieves equivalent results.

Getting started with Oxylabs Python SDK

Installing the SDK

Our SDK requires Python 3.5 or above. To install Oxylabs SDK, run the following pip command in your terminal:

pip install oxylabs

Accessing SDK documentation

While Oxylabs SDK documentation is available on GitHub, you may want to access it in a more readable format. For this, you can use the pdoc library to open the SDK documentation in your browser via localhost. Begin by installing the library using pip:

pip install pdoc

Next, run this line in your terminal:

pdoc -h localhost -p 8080 oxylabs

This will open a new browser window where you can navigate the documentation via http://localhost:8080:

Navigating Oxylabs SDK documentation via localhost

1. Choose the integration method

With Oxylabs SDK, you can easily import and access synchronous and asynchronous clients for calling the API endpoints:

  • Realtime (Sync) - RealtimeClient(username, password)

  • Push-Pull (Async) - AsyncClient(username, password)

  • Proxy Endpoint (Sync) - ProxyClient(username, password)

For more information, see this direct comparison of the integration methods and visit our documentation.

Get a free trial

Claim your 1-week free trial to test Oxylabs' Web Scraper API for your project needs.

  • 5K requests for free
  • No credit card is required
  • As an example, let’s create an asynchronous scraper that communicates with Oxylabs E-Commerce API (now part of a Web Scraper API solution) via the push-pull method. Begin by importing the asyncio, json, and AsyncClient libraries:

    import asyncio, json
    from oxylabs import AsyncClient

    asyncio will be used to execute the tasks asynchronously, while the built-in json library will be used to serialize Python objects into a JSON formatted file. 

    2. Initialize the client

    Next, initialize the client and use Oxylabs API user credentials while also wrapping the client with an asynchronous main() function:

    async def main():
        c = AsyncClient("YOUR_API_USERNAME", "YOUR_API_PASSWORD")
        # Remaining code...
    
        
    if __name__ == "__main__":
        asyncio.run(main())

    3. Define your targets and parameters

    The SDK offers simplified functionality for using the following API sources: Google, Bing, Google Shopping, Amazon, Wayfair, and universal e-commerce. See the table below on how you can access each source and their available scraping methods in Python:

    Source Object Available Methods
    serp.google scrape_ads, scrape_hotels, scrape_images, scrape_search, scrape_suggestions, scrape_travel_hotels, scrape_trends_explore, scrape_url
    serp.bing scrape_search, scrape_url
    ecommerce.google_shopping scrape_product_pricing, scrape_shopping_products, scrape_shopping_search, scrape_shopping_url
    ecommerce.amazon scrape_bestsellers, scrape_pricing, scrape_product, scrape_questions, scrape_reviews, scrape_search, scrape_sellers, scrape_url
    ecommerce.wayfair scrape_search, scrape_url
    ecommerce.universal scrape_url

    With this information in mind, let’s define a query variable and create a tasks list inside the main() function. This will contain three sources for scraping Amazon, Google Shopping, and Walmart search results:

        query = "adidas hoodie women"
    
        # API payloads for Amazon, Google Shopping, and Walmart.
        tasks = [
            c.ecommerce.amazon.scrape_search(
                query,
            ),
            c.ecommerce.google_shopping.scrape_shopping_search(
                query,
            ),
            c.ecommerce.universal.scrape_url(
                f"https://www.walmart.com/search?q={query.replace(' ', '+')}",
            ),
        ]

    Furthermore, let’s add a few more parameters that localize the results for a specific city, parse the raw HTML data, and set the request timeout to 60 seconds along with the request polling interval of 5 seconds:

        query = "adidas hoodie women"
    
        # API payloads for Amazon, Google Shopping, and Walmart.
        tasks = [
            c.ecommerce.amazon.scrape_search(
                query,
                geo_location="62702",
                parse=True,
                timeout=60,
                poll_interval=5,
            ),
            c.ecommerce.google_shopping.scrape_shopping_search(
                query,
                geo_location="Springfield,Illinois,United States",
                parse=True,
                timeout=60,
                poll_interval=5,
            ),
            c.ecommerce.universal.scrape_url(
                f"https://www.walmart.com/search?q={query.replace(' ', '+')}",
                geo_location="Springfield, Illinois",
                parse=True,
                timeout=60,
                poll_interval=5,
            ),
        ]

    4. Access results

    Once the scraping tasks are ready, iterate over the tasks list and use asyncio.as_completed() to return results as they are completed rather than in the order in which they were started. Inside the for loop, you may also want to use the json module to save each result into separate JSON files:

        for task in asyncio.as_completed(tasks):
            response = await task
            print(response.raw)
    
            website_name = response.results[0].url.split("www.")[1].split(".com")[0]
            with open(f"sdk_result_{website_name}.json", "w") as f:
                json.dump(response.results[0].content, f, indent=4)
    
    
    if __name__ == "__main__":
        asyncio.run(main())

    Oxylabs SDK vs. API

    After following the above steps, you should have a Python file that looks like this:

    import asyncio, json
    from oxylabs import AsyncClient
    
    
    async def main():
        c = AsyncClient("YOUR_API_USERNAME", "YOUR_API_PASSWORD")
    
        query = "adidas hoodie women"
    
        tasks = [
            c.ecommerce.amazon.scrape_search(
                query,
                geo_location="62702",
                parse=True,
                timeout=60,
                poll_interval=5,
            ),
            c.ecommerce.google_shopping.scrape_shopping_search(
                query,
                geo_location="Springfield,Illinois,United States",
                parse=True,
                timeout=60,
                poll_interval=5,
            ),
            c.ecommerce.universal.scrape_url(
                f"https://www.walmart.com/search?q={query.replace(' ', '+')}",
                geo_location="Springfield, Illinois",
                parse=True,
                timeout=60,
                poll_interval=5,
            ),
        ]
    
        for task in asyncio.as_completed(tasks):
            response = await task
            print(response.raw)
    
            website_name = response.results[0].url.split("www.")[1].split(".com")[0]
            with open(f"sdk_result_{website_name}.json", "w") as f:
                json.dump(response.results[0].content, f, indent=4)
    
    
    if __name__ == "__main__":
        asyncio.run(main())

    Let’s compare it to an API code that achieves the same result and uses comparatively exact error-handling methods:

    import asyncio, aiohttp, logging, json
    from aiohttp import ClientSession, BasicAuth
    
    
    # Configuration variables.
    CREDENTIALS = BasicAuth("YOUR_API_USERNAME", "YOUR_API_PASSWORD")
    TIMEOUT = 60
    POLL_INTERVAL = 5
    QUERY = "adidas hoodie women"
    
    # API payloads for Amazon, Google Shopping, and Walmart.
    payloads = [
        {
            "source": "amazon_search",
            "query": QUERY,
            "geo_location": "62702",
            "parse": True,
        },
        {
            "source": "google_shopping_search",
            "query": QUERY,
            "geo_location": "Springfield,Illinois,United States",
            "parse": True,
        },
        {
            "source": "universal_ecommerce",
            "url": f"https://www.walmart.com/search?q={QUERY.replace(' ', '+')}",
            "geo_location": "Springfield, Illinois",
            "parse": True,
        }
    ]
    
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)
    
    async def submit_job(session: ClientSession, payload):
        try:
            async with session.post(
                "https://data.oxylabs.io/v1/queries",
                auth=CREDENTIALS,
                json=payload,
            ) as r:
                r.raise_for_status()
                data = await r.json()
                return data["id"]
            
        except aiohttp.ClientResponseError as e:
            logger.error(
                f"HTTP error for Job ID {data['id']}: {e.status} - {e.message} - {data['message']}"
            )
        except aiohttp.ClientConnectionError as e:
            logger.error(
                f"Connection error: {e}"
            )
        except asyncio.TimeoutError:
            logger.error(
                f"The request has timed out."
            )
        except Exception as e:
            logger.error(
                f"Error: {str(e)}"
            )
        return None
    
    
    async def check_job_status(session: ClientSession, job_id):
        end_time = asyncio.get_event_loop().time() + TIMEOUT
        while asyncio.get_event_loop().time() < end_time:
            try:
                async with session.get(
                    f"https://data.oxylabs.io/v1/queries/{job_id}",
                    auth=CREDENTIALS
                ) as r:
                    r.raise_for_status()
                    data = await r.json()
                    if data["status"] == "done":
                        return True
                    elif data["status"] == "faulted":
                        raise Exception(f"Job {job_id} has faulted.")
                
            except Exception as e:
                logger.error(
                    f"Error: {str(e)}"
                )
                return False
            await asyncio.sleep(POLL_INTERVAL)
        
        logger.info("Timeout exceeded.")
        return False
                
    
    async def get_job_results(session: ClientSession, job_id):
        async with session.get(
            f"https://data.oxylabs.io/v1/queries/{job_id}/results",
            auth=CREDENTIALS
        ) as r:
            try:
                r.raise_for_status()
                data = await r.json()
                return data["results"][0]["content"]
            
            except aiohttp.ClientResponseError as e:
                logger.error(
                    f"HTTP error for Job ID {job_id}: {e.status} - {e.message} - {data['message']}"
                )
            except aiohttp.ClientConnectionError as e:
                logger.error(
                    f"Connection error for Job ID {job_id}: {e}"
                )
            except asyncio.TimeoutError:
                logger.error(
                    f"The request has timed out for Job ID {job_id}."
                )
            except Exception as e:
                logger.error(
                    f"Error for Job ID {job_id}: {e}"
                )
            return None
    
    
    async def save_to_json(domain, results):
        with open(f"api_result_{domain}.json", "w") as f:
            json.dump(results, f, indent=4)
    
    
    async def execute(session: ClientSession, payload):
        job_id = await submit_job(session, payload)
        await asyncio.sleep(POLL_INTERVAL)
        if not job_id:
            logger.error("Failed to get the job ID.")
    
        job_completed = await check_job_status(session, job_id)
        if not job_completed:
            logger.error("Job didn't complete successfully.")
    
        results = await get_job_results(session, job_id)
        domain = results["url"].split("www.")[1].split(".com")[0]
        await save_to_json(domain, results)
    
    
    async def main():
        async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=TIMEOUT)) as session:
            tasks = []
            for payload in payloads:
                task = asyncio.ensure_future(execute(session, payload))
                tasks.append(task)
    
            await asyncio.gather(*tasks)
    
    
    if __name__ == "__main__":
        asyncio.run(main())

    As you can see, the API code is significantly more complex and requires more time and effort from the development side to manage asynchronous requests to Oxylabs' Web Scraper API, polling for job status, error handling, and result retrieval. For more details on using Oxylabs solutions, check out Oxylabs documentation and connect with like-minded professionals on Discord.

    Summary

    SDKs minimize most software development difficulties, letting programmers prioritize app functionality and seamless user experience while saving time and costs. Likewise, web scraping SDKs take the role of handling HTTP(S) requests, responses, errors, and scalability, benefiting developers with faster coding while maintaining expert-level functionality and performance. 

    As such, Oxylabs SDKs for Python and Go are an excellent choice if you want to easily integrate web scraping into your infrastructure by utilizing a simplified interface, automated request management, and efficient error handling. Feel free to test Oxylabs SDKs in your projects and claim a free 1-week API trial if you’re new to Oxylabs scraping solutions.

    About the author

    Vytenis Kaubrė

    Technical Copywriter

    Vytenis Kaubrė is a Technical Copywriter at Oxylabs. His love for creative writing and a growing interest in technology fuels his daily work, where he crafts technical content and web scrapers with Oxylabs’ solutions. Off duty, you might catch him working on personal projects, coding with Python, or jamming on his electric guitar.

    All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

    Related articles

    Get the latest news from data gathering world

    I’m interested