The internet is bustling with layers upon layers of publicly available data. How to make more sense of it all? The answer is to scan and collect data with automated tools such as WebHarvy web scraper for further storage and analysis on your digital turf.
WebHarvy is a web scraping tool that extracts text, HTML, and images from web pages. The tool handles logins, form submissions, navigation, pagination, scheduled scraping, and supports proxies.
Follow the tutorial below to learn how to integrate Oxylabs Residential Proxies and start scraping with WebHarvy.
The tool offers easy-to-use third-party proxy support. Either a single proxy or a list of proxy servers could be used for public web data collection. Make sure to avoid using free/open proxy services, as the probability of being shut off in the middle of an operation is high.
First, download and install the WebHarvy app via webharvy.com.
Once set up, navigate to Settings.
Navigating to settings
3. Click on ProxySettings. Select to mark Enable network connection via Proxy Server and choose HTTP as your Type.
4. Fill in the required credentials. Under Address, enter pr.oxylabs.io, and under Port, type in 7777. You can also use country-specific entries. For example, if you fill in us-pr.oxylabs.io under Address and 10000 under Port, you’ll acquire a US exit node. For a complete list of country-specific entry nodes or if you need a sticky session, please refer to our documentation.
5. Click to mark Requires authentication to enter your Oxylabs sub-user’s Username and Password. Click on the + button to add your newly input proxy to the list. Lastly, press Apply to finish your WebHarvy proxy servers integration.
And that’s all. With proxies, WebHarvy can scrape data anonymously without being blocked.
To begin, navigate to a target website. In this case, search results from https://books.toscrape.com/.
Press Start to begin target data selection.
Initiating data selection
3. Select the desired attributes, for example, book titles and prices. The browser allows you to click on specific content for scraping. The cursor detects data patterns that occur on a webpage. If the data repeats, the tool scrapes it automatically without additional user input.
Selecting chunks of content
4. Choose Capture Text and name your items accordingly.
Capturing target data
5. After selecting data to be scraped, press Stop to finish the configuration.
Finishing the configuration
6. Click Start-Mine and press ▶Start to extract your data.
7. After the extraction process is over, click Export and select the export method. WebHarvy saves scraped data in Excel, XML, CSV, JSON, and TSV formats. Alternatively, a database destination could be used as well.
And that’s it. Here’s the final result – a spreadsheet with book titles and prices.
The final result
Implementation of web scraping is a crucial part of up-to-date data-gathering solutions. WebHarvy is a code-free tool able to swiftly scale your daily data processing. As the tool accepts various third-party proxies, be sure to employ a reliable proxy services provider.
If you have any questions configuring our proxies or contemplating using our public web scraping solutions, don’t hesitate to get in touch with us for more information.
Please be aware that this is a third-party tool not owned or controlled by Oxylabs. Each third-party provider is responsible for its own software and services. Consequently, Oxylabs will have no liability or responsibility to you regarding those services. Please carefully review the third party's policies and practices and/or conduct due diligence before accessing or using third-party services.
Is WebHarvy free?
As shareware, WebHarvy offers 15 days free trial.
Does WebHarvy support RegEx?
Yes, Regular Expressions could be used to scrape target data more accurately. You can apply Regular Expressions for selected text and HTML before extraction.
Get the latest news from data gathering world
Get WebHarvy proxies for $15/GB
GET IN TOUCH
Certified data centers and upstream providers
Connect with us
Advanced proxy solutions