Back to blog

Web Crawler: Where to Start & How It Works

Web Crawler

Augustas Pelakauskas

2023-03-133 min read
Share

As an add-on to Oxylabs’ Web Scraper API, Web Crawler allows you to explore and collect data quickly and efficiently using our maintenance-free infrastructure.

What is Web Crawling?

It’s an automated process of discovering target URLs (links). Often, web crawling precedes web scraping to identify (index) target URLs for subsequent data extraction.

Before continuing, make sure to check the differences between web crawling and web scraping.

What is Oxylabs Web Crawler?

Web Crawler is a feature of Oxylabs' Web Scraper API that lets you crawl any website, select useful content, and have it delivered in bulk. The tool can discover all pages on a website and fetch data from them at scale and in real time

How does Web Crawler work?

Web Crawler allows you to leverage our Web Scraper API scraping and parsing functions to crawl websites using advanced features such as JavaScript rendering.

The process encompasses three major stages:

1. User input. First, specify crawling patterns to define the data you need:

  • Select a starting URL.

  • Define the URLs Web Crawler should visit.

  • Define the URLs holding useful information.

  • Specify data collection preferences, such as geolocation.

2. Data discovery. After receiving your input, Web Crawler traverses a website by using links between pages until it finds no more new URLs that match the patterns specified by you.

3. Job results. When Web Crawler finishes a task, you can download the results in a specified format as a ready-to-use file via an API or receive them in your chosen cloud storage bucket.

Web Crawler’s action chain

For a visual breakdown of crawling an e-commerce website, check the video below.

Job setup

The following is a quick overview of what to expect when working with Web Crawler. You can find all the endpoints, parameters, and values with in-depth explanations in our documentation.

Endpoints

With an API client, such as Postman, you can use endpoints to communicate with our APIs. Web Crawler has a number of endpoints you can use to control the process:

  • Initiate, stop, or resume a job.

  • Get job information.

  • Get the list of URLs found while crawling

  • Get the results.

Crawling parameters

You can provide a set of parameters that determine the crawling scope. Choose your starting URL (target) and determine the breadth and depth of the crawling process using filters. Filters also manage the inclusion of URLs in the end result.

Scraping parameters

After defining the parameters that control how Web Crawler treats the URLs found while traversing the site, you may want to add scraping parameters to fine-tune the way of performing scraping jobs.

NOTE: for example, if you're crawling Amazon, you may want to add some parameters specific to our Amazon product data API. Make sure to check the respective scraper documentation for more details.

For instance, you may want to execute Javascript while crawling a site or may want to target a domain from a particular geolocation.

To wrap up the setup, specify your final output parameter to determine the format of your result (see the Job results section).

Job results

Web Crawler aggregates the results into one or more result files (chunks) as a final output ready to use. The three result types are:

  1. List of URLs (sitemap).

  2. JSON file containing an aggregate of parsed results.

  3. JSON file containing an aggregate of HTML results.

NOTE: if you choose the first option, you can download the contents of any URL on the list. The URLs are associated with the scraping job IDs that you can use to fetch scraping results. See here for more details.

Uploading to cloud (optional)

Once the job is done, you can download the results from us or get them delivered to your cloud storage. Specify the exact location, and Web Crawler will upload the file(s) to AWS S3.

Wrapping up

Easy to set up and highly customizable, Web Crawler performs URL discovery by moving through a website using links found on pages it scrapes. Test the Web Crawler’s functionality with a one-week free trial of the Web Scraper API.

If you need guidance in setting up and customizing your crawling tasks with Web Crawler, feel free to contact us via the 24/7 live chat on our home page or send us an email.

We advise you to seek legal consultation before engaging in any kind of scraping activities in order to assess a specific situation and get an expert opinion on further proceedings.

About the author

Augustas Pelakauskas

Senior Copywriter

Augustas Pelakauskas is a Senior Copywriter at Oxylabs. Coming from an artistic background, he is deeply invested in various creative ventures - the most recent one being writing. After testing his abilities in the field of freelance journalism, he transitioned to tech content creation. When at ease, he enjoys sunny outdoors and active recreation. As it turns out, his bicycle is his fourth best friend.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested